Attempts to fix inconsistent repeat lengths found by test_replen
Arguments
- gid
- replen
a numeric vector of repeat motif lengths.
- e
a number to be subtracted or added to inconsistent repeat lengths to allow for proper rounding.
- fix_some
if
TRUE
(default), when there are inconsistent repeat lengths that cannot be fixed by subtracting or adding e, those than can be fixed will. IfFALSE
, the original repeat lengths will not be fixed.
Details
This function is modified from the version used in
doi:10.5281/zenodo.13007
.
Before being fed into the
algorithm to calculate Bruvo's distance, the amplicon length is divided by
the repeat unit length. Because of the amplified primer sequence attached
to sequence repeat, this division does not always result in an integer and
so the resulting numbers are rounded. The rounding also protects against
slight mis-calls of alleles. Because we know that $$\frac{(A - e) - (B
- e)}{r}$$ is equivalent to $$\frac{A - B}{r}$$, we know that the primer sequence will not alter the relationships
between the alleles. Unfortunately for nucleotide repeats that have powers
of 2, rounding in R is based off of the IEC 60559 standard (see
round
), that means that any number ending in 5 is rounded to
the nearest even digit. This function will attempt to alleviate this
problem by adding a very small amount to the repeat length so that division
will not result in a 0.5. If this fails, the same amount will be
subtracted. If neither of these work, a warning will be issued and it is up
to the user to determine if the fault is in the allele calls or the repeat
lengths.
References
Zhian N. Kamvar, Meg M. Larsen, Alan M. Kanaskie, Everett M. Hansen, & Niklaus J. Grünwald. Sudden_Oak_Death_in_Oregon_Forests: Spatial and temporal population dynamics of the sudden oak death epidemic in Oregon Forests. ZENODO, doi:10.5281/zenodo.13007 , 2014.
Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015). Spatial and temporal analysis of populations of the sudden oak death pathogen in Oregon forests. Phytopathology 105:982-989. doi: doi:10.1094/PHYTO-12-14-0350-FI
Ruzica Bruvo, Nicolaas K. Michiels, Thomas G. D'Souza, and Hinrich Schulenburg. A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology, 13(7):2101-2106, 2004.
Examples
data(Pram)
(Pram_replen <- setNames(c(3, 2, 4, 4, 4), locNames(Pram)))
#> PrMS6A1 Pr9C3A1 PrMS39A1 PrMS45A1 PrMS43A1
#> 3 2 4 4 4
fix_replen(Pram, Pram_replen)
#> PrMS6A1 Pr9C3A1 PrMS39A1 PrMS45A1 PrMS43A1
#> 3.00000 2.00000 3.99999 4.00000 4.00000
# Let's start with an example of a tetranucleotide repeat motif and imagine
# that there are twenty alleles all 1 step apart:
(x <- 1:20L * 4L)
#> [1] 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80
# These are the true lengths of the different alleles. Now, let's add the
# primer sequence to them.
(PxP <- x + 21 + 21)
#> [1] 46 50 54 58 62 66 70 74 78 82 86 90 94 98 102 106 110 114 118
#> [20] 122
# Now we make sure that x / 4 is equal to 1:20, which we know each have
# 1 difference.
x/4
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Now, we divide the sequence with the primers by 4 and see what happens.
(PxPc <- PxP/4)
#> [1] 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5
#> [16] 26.5 27.5 28.5 29.5 30.5
(PxPcr <- round(PxPc))
#> [1] 12 12 14 14 16 16 18 18 20 20 22 22 24 24 26 26 28 28 30 30
diff(PxPcr) # we expect all 1s
#> [1] 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0
# Let's try that again by subtracting a tiny amount from 4
(PxPc <- PxP/(4 - 1e-5))
#> [1] 11.50003 12.50003 13.50003 14.50004 15.50004 16.50004 17.50004 18.50005
#> [9] 19.50005 20.50005 21.50005 22.50006 23.50006 24.50006 25.50006 26.50007
#> [17] 27.50007 28.50007 29.50007 30.50008
(PxPcr <- round(PxPc))
#> [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
diff(PxPcr)
#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1