MLG definitions based on genetic distance

Multilocus genotypes are initially defined by naive string matching, but this definition does not take into account missing data or genotyping error, casting these as unique genotypes. Defining multilocus genotypes by genetic distance allows you to incorporate genotypes that have missing data o genotyping error into their parent clusters.

Usage

mlg.filter(
  pop,
  threshold = 0,
  missing = "asis",
  memory = FALSE,
  algorithm = "farthest_neighbor",
  distance = "diss.dist",
  threads = 1L,
  stats = "MLGs",
  ...
)

mlg.filter(
  pop,
  missing = "asis",
  memory = FALSE,
  algorithm = "farthest_neighbor",
  distance = "diss.dist",
  threads = 1L,
  ...
) <- value

Arguments

pop

a genclone, snpclone, or genind object.

threshold

a number indicating the minimum distance two MLGs must be separated by to be considered different. Defaults to 0, which will reflect the original (naive) MLG definition.

missing

any method to be used by missingno: "mean", "zero", "loci", "genotype", or "asis" (default).

memory

whether this function should remember the last distance matrix it generated. TRUE will attempt to reuse the last distance matrix if the other parameters are the same. (default) FALSE will ignore any stored matrices and not store any it generates.

algorithm

determines the type of clustering to be done.

"farthest_neighbor": (default) merges clusters based on the maximum distance between points in either cluster. This is the strictest of the three.
"nearest_neighbor": merges clusters based on the minimum distance between points in either cluster. This is the loosest of the three.
"average_neighbor": merges clusters based on the average distance between every pair of points between clusters.

distance

a character or function defining the distance to be applied to pop. Defaults to diss.dist for genclone objects and bitwise.dist for snpclone objects. A matrix or table containing distances between individuals (such as the output of rogers.dist) is also accepted for this parameter.

threads

(unused) Previously, this was the maximum number of parallel threads to be used within this function. Default is 1 indicating that this function will run serially. Any other number will result in a warning.

stats

a character vector specifying which statistics should be returned (details below). Choices are "MLG", "THRESHOLDS", "DISTANCES", "SIZES", or "ALL". If choosing "ALL" or more than one, a named list will be returned.

...

any parameters to be passed off to the distance method.

value

the threshold at which genotypes should be collapsed.

Value

Default, a vector of collapsed multilocus genotypes. Otherwise, any combination of the following:

MLGs

a numeric vector defining the multilocus genotype cluster of each individual in the dataset. Each genotype cluster is separated from every other genotype cluster by at least the defined threshold value, as calculated by the selected algorithm.

THRESHOLDS

A numeric vector representing the thresholds beyond which clusters of multilocus genotypes were collapsed.

DISTANCES

A square matrix representing the distances between each cluster.

SIZES

The sizes of the multilocus genotype clusters in order.

Details

This function will take in any distance matrix or function and collapse multilocus genotypes below a given threshold. If you use this function as the assignment method (mlg.filter(myData, distance = myDist) <- 0.5), the distance function or matrix will be remembered by the object. This means that if you define your own distance matrix or function, you must keep it in memory to further utilize mlg.filter.

Note

mlg.vector makes use of mlg.vector grouping prior to applying the given threshold. Genotype numbers returned by mlg.vector represent the lowest numbered genotype (as returned by mlg.vector) in in each new multilocus genotype. Therefore mlg.filter and mlg.vector return the same vector when threshold is set to 0 or less.

Examples


data(partial_clone)
pc <- as.genclone(partial_clone, threads = 1L) # convert to genclone object

# Basic Use ---------------------------------------------------------------


# Show MLGs at threshold 0.05
mlg.filter(pc, threshold = 0.05, distance = "nei.dist", threads = 1L)
#>  [1]  8  7 23 24 22 21 10  3 22 11 24  7 25  4 12  2 14  1  7  7  7 26  7 13 23
#> [26]  3 17 22  6 20 22 12  5 25 13 21 15 13 13 13  2 19 18 13 23 16  1 11 25  4
pc # 26 mlgs
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    26 original multilocus genotypes 
#>    50 diploid individuals
#>    10 codominant loci
#> 
#> Population information:
#> 
#>     0 strata. 
#>     4 populations defined - 1, 2, 3, 4

# Set MLGs at threshold 0.05
mlg.filter(pc, distance = "nei.dist", threads = 1L) <- 0.05
pc # 25 mlgs
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    25 contracted multilocus genotypes
#>       (0.05) [t], (nei.dist) [d], (farthest) [a] 
#>    50 diploid individuals
#>    10 codominant loci
#> 
#> Population information:
#> 
#>     0 strata. 
#>     4 populations defined - 1, 2, 3, 4

# \dontrun{

# The distance definition is persistant
mlg.filter(pc) <- 0.1
pc # 24 mlgs
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    24 contracted multilocus genotypes
#>       (0.1) [t], (nei.dist) [d], (farthest) [a] 
#>    50 diploid individuals
#>    10 codominant loci
#> 
#> Population information:
#> 
#>     0 strata. 
#>     4 populations defined - 1, 2, 3, 4

# But you can still change the definition
mlg.filter(pc, distance = "diss.dist", percent = TRUE) <- 0.1
pc
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    25 contracted multilocus genotypes
#>       (0.1) [t], (diss.dist) [d], (farthest) [a] 
#>    50 diploid individuals
#>    10 codominant loci
#> 
#> Population information:
#> 
#>     0 strata. 
#>     4 populations defined - 1, 2, 3, 4

# Choosing a threshold ----------------------------------------------------


# Thresholds for collapsing multilocus genotypes should not be arbitrary. It
# is important to consider what threshold is suitable. One method of choosing
# a threshold is to find a gap in the distance distribution that represents
# clonal groups. You can look at this by analyzing the distribution of all
# possible thresholds with the function "cutoff_predictor".

# For this example, we'll use Bruvo's distance to predict the cutoff for
# P. infestans.

data(Pinf)
Pinf
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    72 multilocus genotypes 
#>    86 tetraploid individuals
#>    11 codominant loci
#> 
#> Population information:
#> 
#>     2 strata - Continent, Country
#>     2 populations defined - South America, North America
# Repeat lengths are necessary for Bruvo's distance
(pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)))
#>    Pi02     D13    Pi33    Pi04    Pi4B    Pi16     G11    Pi56    Pi63    Pi70 
#> 2.00000 2.00000 6.00000 2.00000 1.99999 2.00000 2.00000 2.00000 3.00000 3.00000 
#>    Pi89 
#> 1.99999 

# Now we can collect information of the thresholds. We can set threshold = 1
# because we know that this will capture the maximum possible distance:
(thresholds <- mlg.filter(Pinf, distance = bruvo.dist, stats = "THRESHOLDS",
                          replen = pinfreps, threshold = 1))
#>  [1] 0.01262626 0.02189867 0.02272727 0.02272727 0.03535354 0.04166667
#>  [7] 0.04261364 0.04545454 0.04699337 0.04758520 0.05681818 0.05681818
#> [13] 0.05835701 0.06451231 0.06534091 0.06818182 0.07871686 0.07954545
#> [19] 0.08772748 0.08877841 0.09375000 0.09469697 0.09722222 0.10144413
#> [25] 0.12500000 0.13099747 0.13593750 0.13740234 0.14488636 0.15000000
#> [31] 0.15656566 0.16688366 0.18115234 0.19317072 0.21022726 0.21590909
#> [37] 0.21874983 0.22561553 0.23115234 0.23295450 0.23437499 0.23532197
#> [43] 0.24147723 0.25850032 0.27201702 0.27500000 0.28338066 0.29208984
#> [49] 0.30000000 0.30042336 0.30326990 0.30549961 0.30632250 0.31325684
#> [55] 0.33639034 0.33709162 0.34089797 0.35170815 0.35619658 0.36079233
#> [61] 0.36487924 0.39646464 0.40280346 0.40332030 0.42325106 0.43731965
#> [67] 0.45436164 0.47047212 0.50827603 0.51496688 0.57474140
# We can use these thresholds to find an appropriate cutoff
(pcut <- cutoff_predictor(thresholds))
#> [1] 0.1132221
mlg.filter(Pinf, distance = bruvo.dist, replen = pinfreps) <- pcut
Pinf
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    48 contracted multilocus genotypes
#>       (0.113) [t], (bruvo.dist) [d], (farthest) [a] 
#>    86 tetraploid individuals
#>    11 codominant loci
#> 
#> Population information:
#> 
#>     2 strata - Continent, Country
#>     2 populations defined - South America, North America

# This can also be visualized with the "filter_stats" function.

# Special case: threshold = 0 ---------------------------------------------


# It's important to remember that a threshold of 0 is equal to the original
# MLG definition. This example will show a data set that contains genotypes
# with missing data that share all alleles with other genotypes except for 
# the missing one.

data(monpop)
monpop # 264 mlg
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    264 multilocus genotypes 
#>    694 haploid individuals
#>     13 codominant loci
#> 
#> Population information:
#> 
#>      1 stratum - Pop
#>     12 populations defined - 
#> 7_09_BB, 26_09_BB, 26_09_FR, ..., 45_10_FR, 26_11_BB, 26_11_FR
mlg.filter(monpop) <- 0
nmll(monpop) # 264 mlg
#> [1] 264

# In order to merge these genotypes with missing data, we should set the 
# threshold to be slightly higher than 0. We will use the smallest fraction 
# the computer can store.

mlg.filter(monpop) <- .Machine$double.eps ^ 0.5
nmll(monpop) # 236 mlg
#> [1] 236

# Custom distance ---------------------------------------------------------

# Custom genetic distances can be used either in functions from other
# packages or user-defined functions

data(Pinf)
Pinf
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    48 contracted multilocus genotypes
#>       (0.113) [t], (bruvo.dist) [d], (farthest) [a] 
#>    86 tetraploid individuals
#>    11 codominant loci
#> 
#> Population information:
#> 
#>     2 strata - Continent, Country
#>     2 populations defined - South America, North America
mlg.filter(Pinf, distance = function(x) dist(tab(x))) <- 3
Pinf
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    48 contracted multilocus genotypes
#>       (3) [t], (function(x) dist(tab(x))) [d], (farthest) [a] 
#>    86 tetraploid individuals
#>    11 codominant loci
#> 
#> Population information:
#> 
#>     2 strata - Continent, Country
#>     2 populations defined - South America, North America
mlg.filter(Pinf) <- 4
Pinf
#> 
#> This is a genclone object
#> -------------------------
#> Genotype information:
#> 
#>    51 contracted multilocus genotypes
#>       (4) [t], (function(x) dist(tab(x))) [d], (farthest) [a] 
#>    86 tetraploid individuals
#>    11 codominant loci
#> 
#> Population information:
#> 
#>     2 strata - Continent, Country
#>     2 populations defined - South America, North America

# genlight / snpclone objects ---------------------------------------------


set.seed(999)
gc <- as.snpclone(glSim(100, 0, n.snp.struc = 1e3, ploidy = 2))
gc # 100 mlgs
#>  ||| SNPCLONE OBJECT |||||||||
#> 
#>  || 100 genotypes,  1,000 binary SNPs, size: 183.5 Kb
#>  0 (0 %) missing data
#> 
#>  || Basic content
#>    @gen: list of 100 SNPbin
#>    @mlg: 100 original multilocus genotypes
#>    @ploidy: ploidy of each individual  (range: 2-2)
#> 
#>  || Optional content
#>    @pop: population of each individual (group size range: 50-50)
#>    @other: a list containing: ancestral.pops 
#> 
#> NULL
mlg.filter(gc) <- 0.25
gc # 82 mlgs
#>  ||| SNPCLONE OBJECT |||||||||
#> 
#>  || 100 genotypes,  1,000 binary SNPs, size: 183.5 Kb
#>  0 (0 %) missing data
#> 
#>  || Basic content
#>    @gen: list of 100 SNPbin
#>    @mlg: 82 contracted multilocus genotypes
#>          (0.25) [t], (bitwise.dist) [d], (farthest) [a]
#>    @ploidy: ploidy of each individual  (range: 2-2)
#> 
#>  || Optional content
#>    @pop: population of each individual (group size range: 50-50)
#>    @other: a list containing: ancestral.pops 
#> 
#> NULL

# }