Introduction

This primer provides a concise introduction to conducting applied analyses of population genetic data in R, with a special emphasis on non-model populations including clonal or partially clonal organisms. It is not meant to be a textbook on population genetics. Quite to the contrary, we refer the reader to several textbooks on population genetics (Templeton, 2006; Hartl & Clark, 2007; Nielsen & Slatkin, 2013). Likewise, this book will not replace books on theory and statistics of population genetics (Weir, 1996). The reader is thus expected to have a basic understanding of population genetic theory and applications. Finally, this primer is focused on traditional population genetics based on allele frequencies, rather than more sophisticated coalescent approaches (Hein, Schierup & Wiuf, 2004; Wakeley, 2009; Nielsen & Slatkin, 2013), although some of the material covered here will apply. In a nutshell, this primer provides a valuable resource for tackling the nitty-gritty analysis of populations that do not necessarily conform to textbook genetics and might or might not be in Hardy-Weinberg equilibrium and panmixia.

This primer is geared towards biologists interested in analyzing their populations. This primer does not require extensive knowledge of programming in R, but the user is expected to install R and all packages required for this primer.

Why use R?

Until recently, one of the more tedious aspects of conducting a population genetic analysis was the need for repeated reformatting data to conduct different, complimentary analyses in different programs. Often, these programs only ran on one platform. Now, R provides a toolbox with its packages that allows analysis of most data conveniently without tedious reformatting on all major computing platforms including Microsoft Windows, Linux, and Apple’s OS X. R is an open source statistical programming and graphing language that includes tools for statistical, population genetic, genomic, phylogenetic, and comparative genomic analyses. This primer is for those of us that want to conduct applied analyses of populations and make full use of the palette of tools available in R including for example the R markdown code for this primer.

Note that the R user community is very active and that both R and its packages are regularly updated, critically modified, and noted as deprecated (no longer updated) as appropriate.

Any R user needs to make sure all components are up-to-date and that versions are compatible.

Population genetics

Traditional population genetics is based on analysis of observed allele frequencies compared to frequencies expected, assuming a population genetic model. For example, under a Wright-Fisher model you might expect to see populations of diploid individuals that reproduce sexually, with non-overlapping generations. This model ignores effects such as mutation, recombination, selection or changes in population size or structure. More complex models can incorporate different aspects of effects observed in real populations. However, most of these models assume that populations reproduce sexually.

Here, we briefly review some terminology but assume that the reader has a basic understanding of theory and applications of population genetics.

A locus is a position in the genome where we can observe one or several alleles in different individuals. Loci used in population genetics are assumed to be selectively neutral and can be an anonymous or non-coding region such as a microsatellite locus (SSR), a single nucleotide polymorphism (SNP) or the presence/absence of a band on a gel. A genotype is the combination of alleles carried by a given individual at a particular set of loci. Individuals carrying the same set of alleles are considered to have the same multilocus genotype \(MLG\).

Basic metrics of interest in populations are polymorphisms, allele frequencies and genotype frequencies. Polymorphism can be estimated in several ways, such as the total number of loci observed that have more than one allele. The frequency of an allele is calculated as the number of allele copies observed in a population divided by the total number of individuals, times the ploidy (\(N\) in haploids or \(2N\) in diploids) in the population. For diploids we would observe frequency \(f_{A}\) and \(f_{a}\) for alleles \(A\) and \(a\), respectively:

\(f_{A} = \frac{N_{A}}{2N}\) and \(f_{a} = \frac{N_{a}}{2N}\)

where \(N_{A}\) are the numbers of allele \(A\) segregating in the population genotyped.

By definition, at a biallelic locus frequencies sum to 1:

\(f_{A} + f_{a} = 1\).

Genotype frequencies are the relative frequencies of each \(MLG\) observed in a population. Thus, for diploids at a biallelic locus we can observe three genotypes: \(AA\), \(Aa\), and \(aa\) and their respective frequencies:

\(f_{AA} = \frac{N_{AA}}{2N}\) and \(f_{Aa} = \frac{N_{Aa}}{2N}\) and \(f_{aa} = \frac{N_{aa}}{2N}\)

These frequencies again add up to 1. In diploid organisms we can also calculate the frequency of homozygotes (\(AA\) or \(aa\)) or heterozygotes (\(Aa\)) as the proportion of individuals falling into each class. The proportion of individuals that are heterozygous is given by \(f_{Aa}\) while the homozygous proportion is given by \(1-f_{Aa} = f_{AA}+f_{aa}\).

An important aspect of population structure is the heterozygosity observed within subdivided populations \(H_{S}\) and across total populations \(H_{T}\). If we assume presence of two populations in Hardy-Weinberg equilibrium then frequencies of Allele \(A\) in populations 1 and 2 pooled over both populations would be:

\(f_{A} = \frac{2N_{1}f_{A1} + 2N_{2}f_{A2}}{2N_{1} + 2N_{2}}\)

where each allele frequency is multiplied by population size \(N_{1}\) and \(N_{2}\) and divided by total number of alleles \(2N_{1} + 2N_{2}\). Remember, that in a diploid population there are 2 alleles per locus per individual and hence every frequency is multiplied by 2.

Genetic differentiation of two populations (e.g. subdivision of populations) occurs if we observe differences in polymorphism, allele frequencies, genotype frequencies, and heterozygosity. Wright’s \(F_{st}\) is a commonly used measure of differences in heterozygosity observed in the overall population relative to the subdivided populations and is defined as the differences between \(H_{T}\) and \(H_{S}\) relative to \(H_{T}\):

\(F_{ST} = \frac{H_{T}-H_{S}}{H_{T}}\)

If allele frequencies are identical in both subpopulations then \(H_{T} = H_{S}\). If, on the other hand, allele frequencies vary between subdivided populations, then \(H_{T} > H_{S}\) and populations are considered to be genetically differentiated.

Besides calculation of \(F_{st}\) other approaches can be used to assess population structure including clustering. These approaches will be explored in subsequent chapters.

The special case of clonal organisms

Plants and microorganisms such as fungi, oomycetes, and bacteria often reproduce clonally or follow a mixed reproductive system, including various degrees of clonality and sexuality (Halkett, Simon & Balloux, 2005). Traditional population genetic theory does not apply and these populations violate basic assumptions in the corresponding analyses. Thus, a different set of tools are required for analyses of clonal populations (Anderson & Kohn, 1995; Milgroom, 1996; Balloux, Lehmann & Meeûs, 2003; Halkett et al., 2005; De Meeûs, Lehmann & Balloux, 2006; Grünwald & Goss, 2011). Typically these focus on metrics that are model free and do not assume random mating or random sampling of alleles. Model free metrics suitable for analyzing these populations include measures of genotypic diversity or evenness, as well as cluster analysis based on genetic distance. Additional useful measures include indices of linkage disequiblibrium among markers that are used to infer if populations are reproducing sexually or clonally. We will explore these analyses bit by bit throughout this primer. The poppr R package was specifically developed to facilitate analyses of clonal populations (Kamvar, Tabima & Grünwald, 2014; Kamvar, Brooks & Grünwald, 2015).

Marker systems, ploidy and other kinds of data

Population genetic data come in many shapes and forms and a good analysis needs to be tailored to the marker system, ploidy used and corresponding genetic models assumed (Grünwald & Goss, 2011). Organisms can be haploid, diploid, with or without known phase, or polyploid. In the case of diploid species, marker systems can be dominant (AFLP, RFLP or RAPD data) or co- dominant (SSR, SNPs, allozymes) (Grünwald et al., 2003). A locus can have two alleles (0/1 for AFLP; C/T for SNPs) or many alleles per locus (A/B/C…N for SSRs or allozymes). In each case the data has to be coded and analyzed differently. Mutation rates can also differ for different markers systems (SSR >> mitochondrial SNPs > nuclear SNPs) (Grünwald & Goss, 2011). All these aspects of a particular data set need to be considered in the data analyses.

Applications of population genetic analyses

Population genetic analyses are tremendously valuable for answering questions ranging from applied to basic evolutionary questions (Grünwald & Goss, 2011). For example, a typical concern when finding a new, invasive organism is whether it is introduced to the area or emerged from a resident population. Other questions of interest might include:

Where is the center of origin?
Does this organism reproduce sexually?
Has the population gone through a genetic bottleneck?
Are populations structured by region, geography, micro-environment?
What are source and sink populations and what are the rates of migration?

This primer will provide some of the basic tools needed to answer many of these questions for populations that are clonal or partially clonal. Note, however, that more powerful, complementary coalescent methods will not be covered here. We hope you enjoy following along.

References

Anderson JB., Kohn LM. 1995. Clonality in soilborne, plant-pathogenic fungi. Annual Review of Phytopathology 33:369–391. Available at: http://www.annualreviews.org/doi/abs/10.1146/annurev.py.33.090195.002101

Balloux F., Lehmann L., Meeûs T de. 2003. The population genetics of clonal and partially clonal diploids. Genetics 164:1635–1644. Available at: http://www.genetics.org/content/164/4/1635.full

De Meeûs T., Lehmann L., Balloux F. 2006. Molecular epidemiology of clonal diploids: A quick overview and a short diy (do it yourself) notice. Infection, Genetics and Evolution 6:163–170. Available at: http://www.ncbi.nlm.nih.gov/pubmed/16290062

Grünwald NJ., Goodwin SB., Milgroom MG., Fry WE. 2003. Analysis of genotypic diversity data for populations of microorganisms. Phytopathology 93:738–746. Available at: http://apsjournals.apsnet.org/doi/abs/10.1094/PHYTO.2003.93.6.738

Grünwald NJ., Goss EM. 2011. Evolution and population genetics of exotic and re-emerging pathogens: Novel tools and approaches. Annual Review of Phytopathology 49:249–267. Available at: http://www.annualreviews.org/doi/abs/10.1146/annurev-phyto-072910-095246?journalCode=phyto

Halkett F., Simon J-C., Balloux F. 2005. Tackling the population genetics of clonal and partially clonal organisms. Trends in Ecology & Evolution 20:194–201. Available at: http://www.ncbi.nlm.nih.gov/pubmed/16701368

Hartl D., Clark A. 2007. Principles of population genetics. Sinauer Associates, Incorporated. Available at: http://books.google.com/books?id=SB1vQgAACAAJ

Hein J., Schierup M., Wiuf C. 2004. Gene genealogies, variation and evolution: A primer in coalescent theory. Oxford University Press, USA. Available at: http://books.google.com/books?id=QBC\_SFOamksC

Kamvar ZN., Brooks JC., Grünwald NJ. 2015. Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Name: Frontiers in Genetics 6:208. Available at: http://dx.doi.org/10.3389/fgene.2015.00208

Kamvar ZN., Tabima JF., Grünwald NJ. 2014. \(Poppr\): An R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ 2:e281. Available at: http://dx.doi.org/10.7717/peerj.281

Milgroom MG. 1996. Recombination and the multilocus structure of fungal populations. Annual Review of Phytopathology 34:457–477. Available at: http://www.annualreviews.org/doi/abs/10.1146/annurev.phyto.34.1.457

Nielsen R., Slatkin M. 2013. An introduction to population genetics: Theory and applications. Sinauer Associates, Incorporated. Available at: http://books.google.com/books?id=Iy08kgEACAAJ

Templeton A. 2006. Population genetics and microevolutionary theory. Wiley. Available at: http://books.google.com/books?id=tJVreQjInt0C

Wakeley J. 2009. Coalescent theory: An introduction. Roberts & Company Publishers. Available at: http://books.google.com/books?id=x30RAgAACAAJ

Weir B. 1996. Genetic data analysis 2. Sinauer Associates. Available at: http://books.google.com/books?id=e9QPAQAAMAAJ