Calculate dissimilarity or Euclidean distance for genlight objects
Source:R/bitwise.r
bitwise.dist.Rd
This function calculates both dissimilarity and Euclidean distances for genlight or snpclone objects.
Usage
bitwise.dist(
x,
percent = TRUE,
mat = FALSE,
missing_match = TRUE,
scale_missing = FALSE,
euclidean = FALSE,
differences_only = FALSE,
threads = 0L
)
Arguments
- x
- percent
logical
. Should the distance be represented from 0 to 1? Default set toTRUE
.FALSE
will return the distance represented as integers from 1 to n where n is the number of loci. This option has no effect ifeuclidean = TRUE
- mat
logical
. Return a matrix object. Default set toFALSE
, returning a dist object.TRUE
returns a matrix object.- missing_match
logical
. Determines whether two samples differing by missing data in a location should be counted as matching at that location. Default set toTRUE
, which forces missing data to match with anything.FALSE
forces missing data to not match with any other information, including other missing data.- scale_missing
A logical. If
TRUE
, comparisons with missing data is scaled up proportionally to the number of columns used by multiplying the value bym / (m - x)
where m is the number of loci and x is the number of missing sites. This option matches the behavior of base R'sdist()
function. Defaults toFALSE
.- euclidean
logical
. ifTRUE
, the Euclidean distance will be calculated.- differences_only
logical
. Whendifferences_only = TRUE
, the output will reflect the number of different loci. The default setting,differences_only = FALSE
, reflects the number of different alleles. Note: this has no effect on haploid organisms since 1 locus = 1 allele. This option is NOT recommended.- threads
The maximum number of parallel threads to be used within this function. A value of 0 (default) will attempt to use as many threads as there are available cores/CPUs. In most cases this is ideal. A value of 1 will force the function to run serially, which may increase stability on some systems. Other values may be specified, but should be used with caution.
Details
The default distance calculated here is quite simple and goes by many names depending on its application. The most familiar name might be the Hamming distance, or the number of differences between two strings.
As of poppr version 2.8.0, this function now also calculates Euclidean
distance and is considerably faster and more memory-efficient than the
standard dist()
function.
Note
This function is optimized for genlight and snpclone objects. This does not mean that it is a catch-all optimization for SNP data. Three assumptions must be met for this function to work:
SNPs are bi-allelic
Samples are haploid or diploid
All samples have the same ploidy
If the user supplies a genind or
genclone object, prevosti.dist()
will be used for
calculation.
Examples
set.seed(999)
x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2)
x
#> /// GENLIGHT OBJECT /////////
#>
#> // 10 genotypes, 1,000 binary SNPs, size: 20.6 Kb
#> 0 (0 %) missing data
#>
#> // Basic content
#> @gen: list of 10 SNPbin
#> @ploidy: ploidy of each individual (range: 2-2)
#>
#> // Optional content
#> @pop: population of each individual (group size range: 4-6)
#> @other: a list containing: ancestral.pops
#>
# Assess fraction of different alleles
system.time(xd <- bitwise.dist(x, threads = 1L))
#> user system elapsed
#> 0.000 0.000 0.001
xd
#> 1 2 3 4 5 6 7 8 9
#> 2 0.2230
#> 3 0.2260 0.2280
#> 4 0.2250 0.2170 0.2040
#> 5 0.3795 0.3835 0.3795 0.3805
#> 6 0.4035 0.3985 0.4055 0.3975 0.2100
#> 7 0.4005 0.3955 0.3935 0.3885 0.2000 0.2200
#> 8 0.3880 0.3860 0.3870 0.3960 0.2035 0.2205 0.2135
#> 9 0.3920 0.4030 0.4080 0.3970 0.2135 0.2125 0.2005 0.2150
#> 10 0.3935 0.3905 0.4015 0.3885 0.2230 0.2160 0.2120 0.2265 0.2195
# Calculate Euclidean distance
system.time(xdt <- bitwise.dist(x, euclidean = TRUE, scale_missing = TRUE, threads = 1L))
#> user system elapsed
#> 0 0 0
xdt
#> 1 2 3 4 5 6 7 8
#> 2 23.40940
#> 3 23.74868 24.04163
#> 4 23.36664 23.10844 22.31591
#> 5 34.36568 34.82815 34.71311 34.56877
#> 6 35.59494 35.45420 36.09709 35.76311 22.93469
#> 7 35.53871 35.17101 35.79106 35.00000 22.31591 23.57965
#> 8 34.75629 34.98571 35.12834 35.49648 22.24860 23.08679 23.13007
#> 9 35.49648 35.88872 36.19392 35.41186 23.04344 23.13007 22.24860 22.93469
#> 10 34.71311 34.68429 35.81899 35.05710 23.57965 23.10844 22.80351 23.81176
#> 9
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9
#> 10 23.51595
# \dontrun{
# This function is more efficient in both memory and speed than [dist()] for
# calculating Euclidean distance on genlight objects. For example, we can
# observe a clear speed increase when we attempt a calculation on 100k SNPs
# with 10% missing data:
set.seed(999)
mat <- matrix(sample(c(0:2, NA),
100000 * 50,
replace = TRUE,
prob = c(0.3, 0.3, 0.3, 0.1)),
nrow = 50)
glite <- new("genlight", mat, ploidy = 2)
# Default Euclidean distance
system.time(dist(glite))
#> user system elapsed
#> 1.712 0.057 1.782
# Bitwise dist
system.time(bitwise.dist(glite, euclidean = TRUE, scale_missing = TRUE))
#> user system elapsed
#> 0.673 0.005 0.705
# }