Calculate dissimilarity or Euclidean distance for genlight objects

This function calculates both dissimilarity and Euclidean distances for genlight or snpclone objects.

Usage

bitwise.dist(
  x,
  percent = TRUE,
  mat = FALSE,
  missing_match = TRUE,
  scale_missing = FALSE,
  euclidean = FALSE,
  differences_only = FALSE,
  threads = 0L
)

Arguments

x: a genlight or snpclone object.
percent: logical. Should the distance be represented from 0 to 1? Default set to TRUE. FALSE will return the distance represented as integers from 1 to n where n is the number of loci. This option has no effect if euclidean = TRUE
mat: logical. Return a matrix object. Default set to FALSE, returning a dist object. TRUE returns a matrix object.
missing_match: logical. Determines whether two samples differing by missing data in a location should be counted as matching at that location. Default set to TRUE, which forces missing data to match with anything. FALSE forces missing data to not match with any other information, including other missing data.
scale_missing: A logical. If TRUE, comparisons with missing data is scaled up proportionally to the number of columns used by multiplying the value by m / (m - x) where m is the number of loci and x is the number of missing sites. This option matches the behavior of base R's dist() function. Defaults to FALSE.
euclidean: logical. if TRUE, the Euclidean distance will be calculated.
differences_only: logical. When differences_only = TRUE, the output will reflect the number of different loci. The default setting, differences_only = FALSE, reflects the number of different alleles. Note: this has no effect on haploid organisms since 1 locus = 1 allele. This option is NOT recommended.
threads: The maximum number of parallel threads to be used within this function. A value of 0 (default) will attempt to use as many threads as there are available cores/CPUs. In most cases this is ideal. A value of 1 will force the function to run serially, which may increase stability on some systems. Other values may be specified, but should be used with caution.

Value

A dist object containing pairwise distances between samples.

Details

The default distance calculated here is quite simple and goes by many names depending on its application. The most familiar name might be the Hamming distance, or the number of differences between two strings.

As of poppr version 2.8.0, this function now also calculates Euclidean distance and is considerably faster and more memory-efficient than the standard dist() function.

Note

This function is optimized for genlight and snpclone objects. This does not mean that it is a catch-all optimization for SNP data. Three assumptions must be met for this function to work:

SNPs are bi-allelic
Samples are haploid or diploid
All samples have the same ploidy

If the user supplies a genind or genclone object, prevosti.dist() will be used for calculation.

Author

Zhian N. Kamvar, Jonah C. Brooks

Examples

set.seed(999)
x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2)
x
#>  /// GENLIGHT OBJECT /////////
#> 
#>  // 10 genotypes,  1,000 binary SNPs, size: 20.6 Kb
#>  0 (0 %) missing data
#> 
#>  // Basic content
#>    @gen: list of 10 SNPbin
#>    @ploidy: ploidy of each individual  (range: 2-2)
#> 
#>  // Optional content
#>    @pop: population of each individual (group size range: 4-6)
#>    @other: a list containing: ancestral.pops 
#> 
# Assess fraction of different alleles
system.time(xd <- bitwise.dist(x, threads = 1L))
#>    user  system elapsed 
#>   0.000   0.000   0.001 
xd
#>         1      2      3      4      5      6      7      8      9
#> 2  0.2230                                                        
#> 3  0.2260 0.2280                                                 
#> 4  0.2250 0.2170 0.2040                                          
#> 5  0.3795 0.3835 0.3795 0.3805                                   
#> 6  0.4035 0.3985 0.4055 0.3975 0.2100                            
#> 7  0.4005 0.3955 0.3935 0.3885 0.2000 0.2200                     
#> 8  0.3880 0.3860 0.3870 0.3960 0.2035 0.2205 0.2135              
#> 9  0.3920 0.4030 0.4080 0.3970 0.2135 0.2125 0.2005 0.2150       
#> 10 0.3935 0.3905 0.4015 0.3885 0.2230 0.2160 0.2120 0.2265 0.2195

# Calculate Euclidean distance
system.time(xdt <- bitwise.dist(x, euclidean = TRUE, scale_missing = TRUE, threads = 1L))
#>    user  system elapsed 
#>       0       0       0 
xdt
#>           1        2        3        4        5        6        7        8
#> 2  23.40940                                                               
#> 3  23.74868 24.04163                                                      
#> 4  23.36664 23.10844 22.31591                                             
#> 5  34.36568 34.82815 34.71311 34.56877                                    
#> 6  35.59494 35.45420 36.09709 35.76311 22.93469                           
#> 7  35.53871 35.17101 35.79106 35.00000 22.31591 23.57965                  
#> 8  34.75629 34.98571 35.12834 35.49648 22.24860 23.08679 23.13007         
#> 9  35.49648 35.88872 36.19392 35.41186 23.04344 23.13007 22.24860 22.93469
#> 10 34.71311 34.68429 35.81899 35.05710 23.57965 23.10844 22.80351 23.81176
#>           9
#> 2          
#> 3          
#> 4          
#> 5          
#> 6          
#> 7          
#> 8          
#> 9          
#> 10 23.51595

# \dontrun{

# This function is more efficient in both memory and speed than [dist()] for
# calculating Euclidean distance on genlight objects. For example, we can
# observe a clear speed increase when we attempt a calculation on 100k SNPs
# with 10% missing data:

set.seed(999)
mat <- matrix(sample(c(0:2, NA), 
                     100000 * 50, 
                     replace = TRUE, 
                     prob = c(0.3, 0.3, 0.3, 0.1)),
              nrow = 50)
glite <- new("genlight", mat, ploidy = 2)

# Default Euclidean distance 
system.time(dist(glite))
#>    user  system elapsed 
#>   1.712   0.057   1.782 

# Bitwise dist
system.time(bitwise.dist(glite, euclidean = TRUE, scale_missing = TRUE))
#>    user  system elapsed 
#>   0.673   0.005   0.705 

# }