*** OUT OF DATE ***

We just released a new version of metacoder to CRAN with many changes. We will update this website as soon as possible. In the mean time, look at the intoduction vignette/readme here: https://github.com/grunwaldlab/metacoder

Introduction

Metabarcoding is revolutionizing microbial ecology and presenting new challenges:

  • Numerous database formats make taxonomic data difficult to parse, combine, and subset.
  • Stacked bar charts, commonly used to depict community diversity, lack taxonomic context and are limited by the use of color to represent taxa.
  • Barcode loci and primers are a source of under-explored bias.

MetacodeR is an R package that attempts to addresses these issues:

  • Sources of taxonomic data can be extracted from any file format and manipulated using dpylr-like functions that take into account hierarchical relationships.
  • Community diversity can be visualized by color and size in a type of tree plot we call heat trees.
  • Primer specificity can be estimated with in silico PCR.

Installation

This project is available on CRAN and can be installed like so:

install.packages("metacoder")

You can also intall the development version for the newest features, bugs, and bug fixes:

devtools::install_github(repo="grunwaldlab/metacoder", build_vignettes = TRUE)

If you’ve built the vignettes, you can browse them with:

browseVignettes(package="metacoder")

Dependencies

The function that runs in silico PCR requires primersearch from the EMBOSS (Rice, Longden, and Bleasby 2000) tool kit to be installed. This is not an R package, so it is not automatically installed. Type ?primersearch after installing and loading metacoder for installation instructions.

Extracting taxonomic data

Most databases have a unique file format and taxonomic hierarchy/nomenclature. Taxonomic data can be extracted from any file format using the extract_taxonomy function. Classifications can be parsed offline or retrieved from online databases if a taxon name, taxon ID, or sequence ID is present. A regular expression with capture groups and a corresponding key is used to define how to parse the file. The example code below parses the 16s Ribosome Database Project training set for Mothur which can be found here:

http://mothur.org/w/images/b/b5/Trainset10_082014.rdp.tgz

The file can be parsed using the ape package and the taxonomy data in the headers can be extracted by extract_taxonomy:

library(metacoder)
fasta_path <- file.path("raw_input", "trainset10_082014.rdp.fasta")
seqs <- ape::read.FASTA(fasta_path)
cat(names(seqs)[1])
## AB294171_S001198039  Root;Bacteria;Firmicutes;Bacilli;Lactobacillales;Carnobacteriaceae;Alkalibacterium
data <- extract_taxonomy(seqs, regex = "^(.*)\\t(.*)",
                         key = c(id = "obs_info", "class"),
                         class_sep = ";")

The resulting object contains sequence information associated with an inferred taxonomic hierarchy. The standard print method of the object shows a part of every kind of data it contains:

data
## `taxmap` object with data for 2794 taxa and 10650 observations:
## 
## ------------------------------- taxa -------------------------------
## 1, 2, 3, 4, 5, 6 ... 2788, 2789, 2790, 2791, 2792, 2793, 2794
## 
## ---------------------------- taxon_data ----------------------------
## # A tibble: 2,794 × 3
##   taxon_ids supertaxon_ids            name
##       <chr>          <chr>           <chr>
## 1         1           <NA>            Root
## 2         2              1         Archaea
## 3         3              1        Bacteria
## 4         4              2 "Crenarchaeota"
## 5         5              2 "Euryarchaeota"
## 6         6              2  "Korarchaeota"
## 7         7              2 "Nanoarchaeota"
## # ... with 2,787 more rows
## 
## ----------------------------- obs_data -----------------------------
## # A tibble: 10,650 × 3
##   obs_taxon_ids                  id
##           <chr>               <chr>
## 1          1289 AB294171_S001198039
## 2           369 AB243007_S000622964
## 3          2402 AJ717394_S000623308
## 4          1409 AJ518871_S000351779
## 5           962 CP000159_S000632736
## 6          1311   X76329_S000015720
## 7           616 AB184711_S000652435
## # ... with 1.064e+04 more rows, and 1 more variables: sequence <chr>
## 
## --------------------------- taxon_funcs ---------------------------
## n_obs, n_obs_1, n_supertaxa, n_subtaxa, n_subtaxa_1, hierarchies

Note the taxon_funcs part contains function names (e.g. n_obs) used to calculate taxon statistics. You can see the results of these calculations using the taxon_data function:

taxon_data(data)
## # A tibble: 2,794 × 9
##    taxon_ids supertaxon_ids              name n_obs n_obs_1 n_supertaxa
##        <chr>          <chr>             <chr> <dbl>   <dbl>       <dbl>
## 1          1           <NA>              Root 10650       0           0
## 2          2              1           Archaea   410       0           1
## 3          3              1          Bacteria 10240       0           1
## 4          4              2   "Crenarchaeota"    55       0           2
## 5          5              2   "Euryarchaeota"   335       0           2
## 6          6              2    "Korarchaeota"    10      10           2
## 7          7              2   "Nanoarchaeota"     3       3           2
## 8          8              2 Nanohaloarchaeota     1       0           2
## 9          9              2  "Thaumarchaeota"     6       0           2
## 10        10              4      Thermoprotei    55       0           3
## # ... with 2,784 more rows, and 3 more variables: n_subtaxa <dbl>,
## #   n_subtaxa_1 <dbl>, hierarchies <chr>

There is corresponding obs_data function and obs_funcs list, but they are not used in this case.

Heat trees

The hierarchical nature of taxonomic data makes it difficult to plot effectively. Most often, bar charts, stacked bar charts, or pie graphs are used, but these are ineffective when plotting many taxa or multiple ranks. MetacodeR maps taxonomic data (e.g. sequence abundance) to color/size of tree components in what we call a heat tree:

heat_tree(data, node_size = n_obs, node_label = name, node_color = n_obs)

The default size range displayed is optimized for each plot. The legend represents the number of sequences for each taxon as both a color gradient and width of nodes. Only a few options are needed to make effective plots, yet many are available for customization of publication-ready graphics:

set.seed(8)
heat_tree(data, node_size = n_obs, edge_color = n_supertaxa,
          node_label = name, node_color = n_obs,
          node_color_range = c("cyan", "magenta", "green"),
          edge_color_range   = c("#555555", "#EEEEEE"),
          initial_layout = "reingold-tilford", layout = "davidson-harel",
          overlap_avoidance = 0.5)

The above command can take several minutes since it uses a force-directed layout that requires simulations.

To see the long list of available plotting options, type ?heat_tree.

Subsetting

Taxonomic data in the form used in metacoder can be manipulated using functions inspired by dplyr. For example, taxa can be subset using filter_taxa. Unlike the dplyr function filter, users can choose preserve or remove the subtaxa, supertaxa, and associated observation data of the selected taxa. For example, filter_taxa can be used to look at just the Archaea:

set.seed(1)
heat_tree(filter_taxa(data, name == "Archaea", subtaxa = TRUE),
          node_size = n_obs, node_label = name, 
          node_color = n_obs, layout = "fruchterman-reingold")

Any column displayed by taxon_data can be used with filter_taxa (and most other metacodeR commands) as if it were a variable on its own. To make the Archaea-Bacteria division more clear, the “Root” taxon can be removed, resulting in two separate trees:

subsetted <- filter_taxa(data, n_supertaxa > 0)
set.seed(2)
heat_tree(subsetted, node_size = n_obs, node_label = name,
          node_color = n_obs, tree_label = name)

Although observations (information in obs_data) are typically assinged to the tips of the taxonomy, they can also be assigned to any internal node. When a taxon is removed by a filtering operation, the observations assinged to it are reassigned to an unfiltered supertaxon by default. This makes it easy to remove lower ranks of a taxonomy without discarding oberservations assinged to the tips:

set.seed(1)
filter_taxa(data, n_supertaxa < 4) %>%
  heat_tree(node_size = n_obs, node_label = name, node_color = n_obs)

There is also a filter_obs function which can filter out observations and the taxa they are assigned to.

Sampling

When calculating statistics for taxa, the amount of data should be balanced across taxa and there should be enough data per taxon to provide unbiased estimates. Random samples from large reference databases are biased towards overrepresented taxa. Metacoder provides two ways to randomly sample taxonomic data. The function taxonomic_sample is used to create taxonomically balanced random samples. The acceptable range of sequence or subtaxa counts can be defined for each taxonomic rank; taxa with too few are excluded and taxa with too many are randomly subsampled. The code below samples the data such that rank 6 taxa will have 5 sequences and rank 3 taxa (phyla) will have less than 100:

set.seed(1)
sampled <- taxonomic_sample(subsetted, max_counts = c("3" = 100, "6" = 5), min_counts = c("6" = 5))
sampled <- filter_taxa(sampled, n_obs > 0, subtaxa = FALSE) 
set.seed(3)
heat_tree(sampled, node_size = n_obs, node_label = n_obs, overlap_avoidance = 0.5,
          node_color = n_obs, tree_label = name)

This can also be accomplished using the dplyr equivalents to sample_n and sample_frac by weighting the probability of sampling observations by the inverse of the number of observations:

set.seed(6)
sample_n_obs(subsetted, size = 1000, taxon_weight = 1 / n_obs, unobserved = FALSE) %>%
  heat_tree(node_size = n_obs, node_label = n_obs, overlap_avoidance = 0.5,
            node_color = n_obs, tree_label = name)

In silico PCR

The function primersearch is a wrapper for an EMBOSS tool that implements in silico PCR. The code below estimates the coverage of the universal prokaryotic primer pair 515F/806R (Walters et al. 2016):

pcr <- primersearch(sampled, 
                    forward = c("515F" = "GTGYCAGCMGCCGCGGTAA"), 
                    reverse = c("806R" = "GGACTACNVGGGTWTCTAAT"), 
                    mismatch = 10)
taxon_data(pcr)
## # A tibble: 1,999 × 11
##    taxon_ids supertaxon_ids              name n_obs n_obs_1 n_supertaxa
##        <chr>          <chr>             <chr> <dbl>   <dbl>       <dbl>
## 1          2           <NA>           Archaea   362       0           0
## 2          3           <NA>          Bacteria  4569       0           0
## 3          4              2   "Crenarchaeota"    55       0           1
## 4          5              2   "Euryarchaeota"   287       0           1
## 5          6              2    "Korarchaeota"    10      10           1
## 6          7              2   "Nanoarchaeota"     3       3           1
## 7          8              2 Nanohaloarchaeota     1       0           1
## 8          9              2  "Thaumarchaeota"     6       0           1
## 9         10              4      Thermoprotei    55       0           2
## 10        11             10      Acidilobales     3       0           3
## # ... with 1,989 more rows, and 5 more variables: n_subtaxa <dbl>,
## #   n_subtaxa_1 <dbl>, hierarchies <chr>, count_amplified <dbl>,
## #   prop_amplified <dbl>

The proportion of sequences amplified can be represented by color in a heat tree:

set.seed(3)
heat_tree(pcr, node_size = n_obs, node_label = name,
          node_color = prop_amplified,
          node_color_range =  c("red", "orange", "yellow", "green", "cyan"),
          node_color_trans = "linear", tree_label = name)

This plot makes it apparent that most taxa were amplified, but not all. The data can also be subset to better see what did not get amplified:

set.seed(1) 
pcr %>%
  filter_taxa(count_amplified < n_obs) %>% 
  heat_tree(node_size = n_obs, node_label = name, node_color = prop_amplified,
            node_color_range =  c("red", "orange", "yellow", "green", "cyan"),
            initial_layout = "reingold-tilford", layout = "davidson-harel",
            node_color_interval = c(0, 1), node_color_trans = "linear",
            node_label_max = 500, tree_label = name)

Differential heat trees

We can use what we call differential heat trees to compare the values of treatments such as:

  • the relative abundance of taxa in two communities
  • the coverages of different primer pairs

Here, we compare the effectiveness 515F/806R to another primer pair 357F/519F, by plotting the difference in proportions amplified by each. First, the same sequences are amplified with 357F/519F and results for the two primer pairs combined:

pcr_2 <- primersearch(sampled,
                      forward = c("357F" = "CTCCTACGGGAGGCAGCAG"), 
                      reverse = c("519F" = "GWATTACCGCGGCKGCTG"), 
                      mismatch = 10)
pcr <- mutate_taxa(pcr, 
                   count_amplified_2 = taxon_data(pcr_2, col_subset = "count_amplified", drop = TRUE),
                   prop_diff = prop_amplified - taxon_data(pcr_2, col_subset = "prop_amplified", drop = TRUE))

Then, taxa that are not amplified by both pairs can be subset and the difference in amplification plotted. In the plot below, green corresponds to taxa amplified by 515F/806R but not 357F/519F and brown is the opposite:

set.seed(2)
pcr %>%
  filter_taxa(abs(prop_diff) > 0.1, supertaxa = TRUE) %>%
  heat_tree(node_size = n_obs,
            node_label = ifelse(abs(prop_diff) > 0.3, name, NA),
            node_color = prop_diff,
            node_color_range = diverging_palette(),
            node_color_interval = c(-1, 1),
            node_color_trans = "linear",
            node_label_max = 500,
            tree_label = name,
            node_color_axis_label = "Difference in proportion amplified",
            node_size_axis_label = "Sequence count",
            initial_layout = "reingold-tilford", layout = "davidson-harel",
            title = "Comparison of two universal primer pairs using in silico PCR",
            title_size = 0.03)

Future development

MetacodeR is under active development and many new features are planned. Some improvements that are being explored include:

  • Plotting functions for pairwise comparison of treatments (see figure 5 in the publication).
  • Barcoding gap analysis and associated plotting functions
  • A function to aid in retrieving appropriate sequence data from NCBI for in silico PCR from whole genome sequences.
  • Graphing of different node shapes in heat trees, possibly including pie graphs or other graphs.

To see the details of what is being worked on, check out the issues tab of the MetacodeR Github site.

Aknowledgements

We thank Thomas Sharpton for sharing his metagenomics expertise and advising us. metacoders’s major dependencies are taxize, igraph, and ggplot2.

Feedback and contributions

We would like to hear about users’ thoughts on the package and any errors they run into. Please report errors, questions or suggestions on the issues tab of the MetacodeR Github site. We also welcome contributions via a Github pull request. You can also talk with us using our Google groups site or the comments section below.

Software and packages used

sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] metacoder_0.1.2     knitcitations_1.0.7 knitr_1.15.1       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10      plyr_1.8.4        bitops_1.0-6      tools_3.3.3      
##  [5] digest_0.6.12     gtable_0.2.0      lubridate_1.6.0   evaluate_0.10    
##  [9] tibble_1.3.0      nlme_3.1-131      lattice_0.20-34   bibtex_0.4.0     
## [13] igraph_1.0.1      DBI_0.6-1         yaml_2.1.14       parallel_3.3.3   
## [17] RefManageR_0.13.1 dplyr_0.5.0       httr_1.2.1        stringr_1.2.0    
## [21] rprojroot_1.2     R6_2.2.0          XML_3.98-1.6      rmarkdown_1.4    
## [25] RJSONIO_1.3-0     reshape2_1.4.2    ggplot2_2.2.1     magrittr_1.5     
## [29] backports_1.0.5   scales_0.4.1      htmltools_0.3.5   assertthat_0.1   
## [33] ape_4.1           colorspace_1.3-2  labeling_0.3      stringi_1.1.3    
## [37] RCurl_1.95-4.8    lazyeval_0.2.0    munsell_0.4.3

References

Rice, Peter, Ian Longden, and Alan Bleasby. 2000. “EMBOSS: The European Molecular Biology Open Software Suite.” Elsevier Current Trends.

Walters, William, Embriette R Hyde, Donna Berg-Lyons, Gail Ackermann, Greg Humphrey, Alma Parada, Jack A Gilbert, et al. 2016. “Improved Bacterial 16S RRNA Gene (V4 and V4-5) and Fungal Internal Transcribed Spacer Marker Gene Primers for Microbial Community Surveys.” MSystems 1 (1). American Society for Microbiology Journals: e00009–15.

Comments