Analysis of Microbiome Community Data in R

Welcome! This primer provides a concise introduction to conducting applied analyses of microbiome data in R. While this primer does not require extensive knowledge of programming in R, the user is expected to install R and all packages required for this primer.

Please install the required software and download the example data before coming to the workshop.

What is covered

This primer provides a concise introduction to conducting the statistical analyses and visualize microbiome data in R based on metabarcoding and high throughput sequencing (HTS). This primer does not cover “shotgun” metagenomic analysis, which is very different in nature. The reader is expected to have a very basic understanding of ecological diversity theory and some experience with R. The techniques presented here assume the raw sequences have been converted to exact sequence variants (ESVs) or operational Taxonomic Units (OTUs) and classified (i.e., assigned a taxonomy) using tools such as QIIME, mothur, or dada2 (Schloss et al. 2009; Caporaso et al. 2010; Callahan et al. 2016).

Why use R?

R is an open source (free) statistical programming and graphing language that includes tools for analysis of statistical, ecological diversity and community data, among many other things. R provides a cohesive environment to analyze data using modular “toolboxes” called R packages. R runs on all major operating systems including Microsoft Windows, Linux (e.g., Ubuntu), and Apple’s OS X. The general type of analyses done in this workshop could be done in python, Perl, or using command line tools. We like R for the following reasons:

  • We use it (i.e., we are biased).
  • R packages are easy to install and not too hard to make.
  • The R community is very active and growing. Packages are updated frequently.
  • Repositories such as the Comprehensive R Archive Network (CRAN) and Bioconductor provide some quality control of packages and make them easy to install.
  • RStudio is a great, free graphics user interface.
  • R Markdown is well supported, allowing R code to be embedded in documents and output to diverse formats. This website is the output of a set of R Markdown documents. Thus, R markdown can be used as an electronic notebook facilitating for reproducible research link.
  • R has strong graphing and statistical capabilities.
  • You can produce publication ready graphics
  • R is designed to be an interactive language, providing a natural fit for statistical analyses rather than writing programs.

The kind of data used in this workshop

This workshop will not start with the raw reads, since the first steps in a metabarcoding workflow are typically done using command line tools such as QIIME or mothur (dada2 is an exception) in the cloud. Data that can be analysed using techniques presented here is typically the result of the following steps (Comeau, Douglas, and Langille 2017):

  1. Sample environments/soil/tissue/water and extract DNA.
  2. Perform PCR using standard primers.
  3. Sequence using a high-throughput sequencing platform such as the Illumina MiSeq.
  4. Call OTUs/ESVs and assign a taxonomic classification by comparing them to a reference database, such as Greengenes.
  5. Construct an abundance matrix of read counts for each OTU in each sample.

Here we focus on the statistical analysis and visualizations following OTU calling that include:

  • Reading files into R
  • Manipulating tabular and taxonomic data
  • Heat trees (Foster, Sharpton, and Grünwald 2017), stacked bar charts and related visualizations
  • Alpha and beta diversity
  • Ordination methods

Help us improve this resource

We hope you enjoy this primer. Please provide us feedback on any errors you might find or suggestions for improvement.

Citing this primer

Please cite this primer if you find it useful for your research as: ZSL Foster and NJ Grünwald. 2018. Analysis of Microbiome Community Data in R. DOI: XXX.

Niklaus J. Grünwald ORCID iD and Zach S. L. Foster ORCID iD

© 2018, Corvallis, Oregon, USA


Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7). Nature Publishing Group: 581.

Caporaso, J Gregory, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D Bushman, Elizabeth K Costello, Noah Fierer, et al. 2010. “QIIME Allows Analysis of High-Throughput Community Sequencing Data.” Nature Methods 7 (5). Nature Publishing Group: 335–36.

Comeau, André M, Gavin M Douglas, and Morgan GI Langille. 2017. “Microbiome Helper: A Custom and Streamlined Workflow for Microbiome Research.” mSystems 2 (1). Am Soc Microbiol: e00127–16.

Foster, Zachary SL, Thomas J Sharpton, and Niklaus J Grünwald. 2017. “Metacoder: An R Package for Visualization and Manipulation of Community Taxonomic Diversity Data.” PLoS Computational Biology 13 (2). Public Library of Science: e1005404.

Schloss, Patrick D, Sarah L Westcott, Thomas Ryabin, Justine R Hall, Martin Hartmann, Emily B Hollister, Ryan A Lesniewski, et al. 2009. “Introducing Mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities.” Applied and Environmental Microbiology 75 (23). Am Soc Microbiol: 7537–41.