Manipulating taxonomic data

Tabular data is relatively straight forward to manipulate if you know some R already, but taxonomic data is hierarchical, making it much harder to work with. It gets even more complicated when there are other data associated with the taxonomy.

For example, if we want to remove a bacterial phylum from the data, do we also remove all of the taxa within that phylum? If so, how do we identify which taxa are in that phylum? Depending on the format the data is in, this might be hard. If other data (e.g. OTU counts) are associated with taxa within that phylum, do we also remove those too? After all, even if the phylum and the taxa within it are removed, we still know that all the associated data are bacterial, so do we reassign them to the Bacteria taxon, or just throw those data out? The answer to these kind of questions will depend on our data and what we want to do with it.

Load example data

If you are starting the workshop at this section, or had problems running code in a previous section, use the following to load the data used in this section. You can download the “parsed_data.Rdata” file here. If you have just done the previous section, you can ignore this and proceed.

load("parsed_data.Rdata")

Subsetting tabular data

The functions to subset taxonomic information in taxmap objects (the format our example data is in) provided by the metacoder package are modeled after functions in the dplyr package for manipulating tabular data (Wickham and Francois (2015)). These dplyr functions are alternatives to the standard R subsetting (e.g. [, [[, and $) designed to be be more intuitive and consistent. Since an understanding of these functions will help with the analogous (and more complicated) functions for taxonomic data, we will start by practicing these. Fortunately, we have a tabular data set in need of some subsetting: our sample data.

First, lets take another look at our sample data:

print(sample_data)

## # A tibble: 1,698 × 16
##    SampleID   Name      Plant…¹ Type  Exper…² Cohort Harve…³ Age   Site  Treat…⁴ Line  Genot…⁵ Block
##    <chr>      <chr>     <chr>   <chr> <chr>   <chr>  <chr>   <chr> <chr> <chr>   <chr> <chr>   <chr>
##  1 M1024P1788 R_026.ro… R_026   root  fieldB… 2008   2011    3     LTM   field   26    ril     NA   
##  2 M1024P1791 R_073.ro… R_073   root  fieldB… 2008   2011    3     LTM   field   73    ril     NA   
##  3 M1024P1795 R_088.ro… R_088   root  fieldB… 2009   2011    2     LTM   field   88    ril     NA   
##  4 M1024P1807 R_156.ro… R_156   root  fieldB… 2009   2011    2     LTM   field   156   ril     NA   
##  5 M1955P804  1_A_1.ro… 1_A_1   root  ecoGH   NA     2011    NA    Duke  MAHsoil par1… PAR     b01  
##  6 M1956P837  1_A_12.r… 1_A_12  root  ecoGH   NA     2011    NA    Duke  MAHsoil mah3… MAH     b01  
##  7 M1957P983  1_A_3.ro… 1_A_3   root  ecoGH   NA     2011    NA    Duke  MAHsoil silL… SIL     b01  
##  8 M1956P845  1_A_4.ro… 1_A_4   root  ecoGH   NA     2011    NA    Duke  MAHsoil par1… PAR     b01  
##  9 M1957P987  1_A_7.ro… 1_A_7   root  ecoGH   NA     2011    NA    Duke  MAHsoil par9… PAR     b01  
## 10 M1957P923  1_A_8.ro… 1_A_8   root  ecoGH   NA     2011    NA    Duke  MAHsoil mil5… MIL     b01  
## # … with 1,688 more rows, 3 more variables: oldPlate <chr>, newPlate <chr>, Analysis <chr>, and
## #   abbreviated variable names ¹Plant_ID, ²Experiment, ³Harvested, ⁴Treatment, ⁵Genotype

This is a large data set and not all the 1698 samples are used in the main publication of Wagner et al. (2016). Only the samples in the “ecotypes” experiment are part of the main study. Using standard R subsetting we could get just those rows like so:

sample_data <- sample_data[sample_data$Experiment == "ecotypes", ]

Using the dplyr functions, the same operation would be:

library(dplyr)
sample_data <- filter(sample_data, Experiment == "ecotypes")

Note how we could use the Experiment column on its own within the filter functions. That is a special property of the filter function and many others in dplyr known as Non-standard evaluation (NSE). If we tried that outside of the function, we would get an error. For example, the following will not work:

is_ecotype <- Experiment == "ecotypes"

## Error in eval(expr, envir, enclos): object 'Experiment' not found

Most dplyr functions work this way and so do the analogous functions in metacoder for manipulating taxonomic data. If we have multiple filtering criteria, we can simply add more arguments to the functions. In the following code, we subset the table to only rows for plants from one of three site and that are 3 years old.

sample_data <- filter(sample_data,
                      Site %in% c("Mah", "Jam", "Sil"),
                      Age == 3)

This is usually equivalent to combining multiple conditions with & in normal subsetting, but is faster for large data sets. It also will not return values for which conditions evaluate to NA, unlike base R subsetting.

Up until now, everything we have done using filter can be easily done with base R subsetting as well, but dplyr introduces a few more advanced features that can save time/typing. We wont go into these in detail, but its good to know they exist. One is the concept of “grouping”. When rows are “grouped” by the values in one column, the rows corresponding to each unique value in that column are treated as a unit in some functions. n() is one such function that counts the number of items in each group. We can use this to filter out any plants that don’t have both root and leaf samples by counting the number of times each “Plant_ID” shows up:

sample_data <- sample_data %>%
  group_by(Plant_ID) %>%
  filter(n() == 2)

This leaves us with 292 samples. Note the use of the %>% “piping” operator. This takes the result of the function that comes before and uses it as the first argument to function that comes after. For example, seq(1:10) %>% length() is the same as length(seq(1:10)). This allows for chaining many commands together in a readable way. Next we will subset the abundance data to just these samples to get the dataset down to a manageable size for a standard personal computer.

Subsetting a taxonomy

The analog to filter for taxonomic data is filter_taxa from the metacoder package. This works in a similar way, but has many more options for how taxonomic relationships are used and what is done with data associated with removed taxa. Lets take another look at our data object:

print(obj)

## <Taxmap>
##   1558 taxa: aab. Unassigned, aac. Root ... chx. Methanospirillaceae, chy. 
##   1558 edges: NA->aab, NA->aac, aac->aad, aac->aae ... bel->chw, ays->chx, bem->chy
##   1 data sets:
##     otu_counts:
##       # A tibble: 47,806 × 1,707
##         taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
##         <chr>    <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 aab      1               0       0       0       0       0       0       0       0
##       2 ben      2              41      22       4     726  112492       2     413       2
##       3 beo      3              67      65      12   13514       1       4   13314      70
##       # … with 47,803 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
##       #   M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
##       #   M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
##       #   names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
##       #   ⁷M1551P90
##   0 functions:

If we look at the second line, we can see that some taxa do not have names (the taxon with the ID “chy”). These are the ones that only had rank information in the classification we parsed in the last section (e.g. o__ in "Root;k__Bacteria;p__Chlorobi;c__SJA-28;o__;f__"). Since these do not add any information, lets remove them.

library(metacoder)
obj <- filter_taxa(obj, taxon_names != "")
print(obj)

## <Taxmap>
##   959 taxa: aab. Unassigned, aac. Root ... chs. Alcanivoracaceae, chx. Methanospirillaceae
##   959 edges: NA->aab, NA->aac, aac->aad, aac->aae ... anq->cho, apt->chs, ays->chx
##   1 data sets:
##     otu_counts:
##       # A tibble: 47,806 × 1,707
##         taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
##         <chr>    <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 aab      1               0       0       0       0       0       0       0       0
##       2 ben      2              41      22       4     726  112492       2     413       2
##       3 beo      3              67      65      12   13514       1       4   13314      70
##       # … with 47,803 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
##       #   M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
##       #   M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
##       #   names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
##       #   ⁷M1551P90
##   0 functions:

There were OTUs associated with the nameless taxa, but no OTUs were filtered out, so what happened to them? By default, they are reassigned to the closest supertaxon that passes the filter. For example, for the classification "Root;k__Bacteria;p__Chlorobi;c__SJA-28;o__;f__", the OTU would be reassigned to the class “SJA-28” and the taxon ID for that OTU would be changed to the taxon ID for “SJA-28”.

Like filter, filter_taxa can use variables contained in the input objects as if they were independent variables. In this case, taxon_names is actually a function that is called when referenced in this way.

head(taxon_names(obj))

##              aab              aac              aad              aae              aaf 
##     "Unassigned"           "Root"       "Bacteria"        "Archaea" "Proteobacteria" 
##              aag 
## "Actinobacteria"

Other values that can be used this way include columns in tables and lists/vectors stored in the obj$data list in taxmap objects. Since taxmap objects can store any number of lists, vectors, or tables mapped to a taxonomy, there can be many variables that can be referenced by name. The all_names function returns the name and location of variables that can be used this way:

head(all_names(obj), 20)

##                          taxon_names                            taxon_ids 
##                        "taxon_names"                          "taxon_ids" 
##                        taxon_indexes                      classifications 
##                      "taxon_indexes"                    "classifications" 
##                          n_supertaxa                        n_supertaxa_1 
##                        "n_supertaxa"                      "n_supertaxa_1" 
##                            n_subtaxa                          n_subtaxa_1 
##                          "n_subtaxa"                        "n_subtaxa_1" 
##                             n_leaves                           n_leaves_1 
##                           "n_leaves"                         "n_leaves_1" 
##                          taxon_ranks                              is_root 
##                        "taxon_ranks"                            "is_root" 
##                              is_stem                            is_branch 
##                            "is_stem"                          "is_branch" 
##                              is_leaf                         is_internode 
##                            "is_leaf"                       "is_internode" 
##                                n_obs                              n_obs_1 
##                              "n_obs"                            "n_obs_1" 
##     data[['otu_counts']][['OTU_ID']] data[['otu_counts']][['M1024P1833']] 
##                             "OTU_ID"                         "M1024P1833"

length(all_names(obj))

## [1] 1724

Next, lets also filter out anything not in Bacteria, since the focus of the study was Bacteria.

obj <- filter_taxa(obj, taxon_names == "Bacteria", subtaxa = TRUE)
print(obj)

## <Taxmap>
##   898 taxa: aad. Bacteria, aaf. Proteobacteria ... cho. wb1_P06, chs. Alcanivoracaceae
##   898 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... bea->chi, anq->cho, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 33,641 × 1,707
##         taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
##         <chr>    <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              41      22       4     726  112492       2     413       2
##       2 beo      3              67      65      12   13514       1       4   13314      70
##       3 ben      4            3229   13679    1832     951     113    2496     567    2428
##       # … with 33,638 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
##       #   M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
##       #   M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
##       #   names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
##       #   ⁷M1551P90
##   0 functions:

Note the use of the subtaxa = TRUE; this means that all the subtaxa of taxa that pass the filter should also be preserved. If we left this option off, we would would have only a single taxon with all OTUs assigned to it:

filter_taxa(obj, taxon_names == "Bacteria")

## <Taxmap>
##   1 taxa: aad. Bacteria
##   1 edges: NA->aad
##   1 data sets:
##     otu_counts:
##       # A tibble: 33,641 × 1,707
##         taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
##         <chr>    <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 aad      2              41      22       4     726  112492       2     413       2
##       2 aad      3              67      65      12   13514       1       4   13314      70
##       3 aad      4            3229   13679    1832     951     113    2496     567    2428
##       # … with 33,638 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
##       #   M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
##       #   M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
##       #   names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
##       #   ⁷M1551P90
##   0 functions:

Note how the taxon IDs for OTUs change when taxa are filtered. Next, lets remove the columns from the abundance matrix that were removed from the sample data at the start of the section.

obj$data$otu_counts <- obj$data$otu_counts[c("taxon_id", "OTU_ID", sample_data$SampleID)]

This leaves us with a data set of 33,641 OTUs in 292 samples, classified by 898 taxa:

print(obj)

## <Taxmap>
##   898 taxa: aad. Bacteria, aaf. Proteobacteria ... cho. wb1_P06, chs. Alcanivoracaceae
##   898 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... bea->chi, anq->cho, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 33,641 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 33,638 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

Subsetting data classified by a taxonomy

In the previous section we subset the taxonomy in a taxmap object using filter_taxa. By default, when taxa are removed, the OTUs assigned to them were reassigned to a supertaxon or removed as well. We can also subset the OTU table directly and remove any taxa that are no longer needed using the function filter_obs. The “obs” in the function name stands for “observations”. Unlike filter, which you used previously to subset tables, filter_obs works on a data set in a taxmap object, like our OTU table.

In the last section we removed samples (i.e. columns in the OTU table) that did not appear in the Wagner et al. (2016) publication. This probably means that some of the OTUs only existed in those samples and now have no reads in the remaining samples. Lets remove those to make the data set more manageable. First we need to find which OTUs (i.e. rows) no have no reads. We can do that by using the function rowSums to get the number of reads in all samples for each OTU.

has_no_reads <- rowSums(obj$data$otu_counts[, sample_data$SampleID]) == 0

Since the table otu_counts also has the taxon ID and OTU ID columns we need to subset the columns to just the samples before using rowSums. The command above results in a logical vector, which is a series of TRUE/FALSE values that can be used to subset things. We can use sum to count the number of TRUE values and hence the number of OTUs without reads.

sum(has_no_reads)

## [1] 11957

There are 11,957 of 33,641 OTUs that now have no reads. Now that we know which they are, we can remove them using filter_obs.

filter_obs(obj, "otu_counts", ! has_no_reads) # note the ! negation operator

## <Taxmap>
##   898 taxa: aad. Bacteria, aaf. Proteobacteria ... cho. wb1_P06, chs. Alcanivoracaceae
##   898 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... bea->chi, anq->cho, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

Note how this removed rows in the “otu_counts” table but did not remove any taxa. To remove taxa that are no longer observed in any OTUs, we need to add the drop_taxa option:

obj <- filter_obs(obj, "otu_counts", ! has_no_reads, drop_taxa = TRUE)
print(obj)

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

If there were other tables in obj$data the data for those taxa would have been removed as well unless drop_obs = FALSE was added. Since there is only the one table, we don’t need to worry about that.

Exercises

In these exercises, we will be using obj and sample_data from the analysis above. If you did not run the code above or had problems, run the following code to get the objects used. You can download the “filtered_data.Rdata” file here.

load("filtered_data.Rdata")

Subsetting tabular data

1) Try to subset sample_data to just “leaf” samples using standard R subsetting (e.g. my_table[rows, columns]).

sample_data[sample_data$Type == "leaf", ]

## # A tibble: 146 × 16
## # Groups:   Plant_ID [146]
##    SampleID  Name       Plant…¹ Type  Exper…² Cohort Harve…³ Age   Site  Treat…⁴ Line  Genot…⁵ Block
##    <chr>     <chr>      <chr>   <chr> <chr>   <chr>  <chr>   <chr> <chr> <chr>   <chr> <chr>   <chr>
##  1 M1981P563 8_0053.le… 8_0053  leaf  ecotyp… 2008   2011    3     Mah   field   MAH.… MAH     8_02 
##  2 M1980P502 8_0074.le… 8_0074  leaf  ecotyp… 2008   2011    3     Mah   field   JAM.… JAM     8_02 
##  3 M1981P599 8_0087.le… 8_0087  leaf  ecotyp… 2008   2011    3     Mah   field   MIL.… MIL     8_02 
##  4 M1981P587 8_0090.le… 8_0090  leaf  ecotyp… 2008   2011    3     Mah   field   SILL… SIL     8_02 
##  5 M1979P447 8_0096.le… 8_0096  leaf  ecotyp… 2008   2011    3     Mah   field   PAR.… PAR     8_02 
##  6 M1982P661 8_0149.le… 8_0149  leaf  ecotyp… 2008   2011    3     Jam   field   SILL… SIL     8_04 
##  7 M1980P462 8_0157.le… 8_0157  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_04 
##  8 M1980P471 8_0158.le… 8_0158  leaf  ecotyp… 2008   2011    3     Jam   field   JAM.… JAM     8_04 
##  9 M1982P655 8_0167.le… 8_0167  leaf  ecotyp… 2008   2011    3     Jam   field   PAR.… PAR     8_04 
## 10 M1981P623 8_0174.le… 8_0174  leaf  ecotyp… 2008   2011    3     Jam   field   MAH.… MAH     8_04 
## # … with 136 more rows, 3 more variables: oldPlate <chr>, newPlate <chr>, Analysis <chr>, and
## #   abbreviated variable names ¹Plant_ID, ²Experiment, ³Harvested, ⁴Treatment, ⁵Genotype

2) Try to do the same thing using the filter function from the dplyr package.

filter(sample_data, Type == "leaf")

## # A tibble: 146 × 16
## # Groups:   Plant_ID [146]
##    SampleID  Name       Plant…¹ Type  Exper…² Cohort Harve…³ Age   Site  Treat…⁴ Line  Genot…⁵ Block
##    <chr>     <chr>      <chr>   <chr> <chr>   <chr>  <chr>   <chr> <chr> <chr>   <chr> <chr>   <chr>
##  1 M1981P563 8_0053.le… 8_0053  leaf  ecotyp… 2008   2011    3     Mah   field   MAH.… MAH     8_02 
##  2 M1980P502 8_0074.le… 8_0074  leaf  ecotyp… 2008   2011    3     Mah   field   JAM.… JAM     8_02 
##  3 M1981P599 8_0087.le… 8_0087  leaf  ecotyp… 2008   2011    3     Mah   field   MIL.… MIL     8_02 
##  4 M1981P587 8_0090.le… 8_0090  leaf  ecotyp… 2008   2011    3     Mah   field   SILL… SIL     8_02 
##  5 M1979P447 8_0096.le… 8_0096  leaf  ecotyp… 2008   2011    3     Mah   field   PAR.… PAR     8_02 
##  6 M1982P661 8_0149.le… 8_0149  leaf  ecotyp… 2008   2011    3     Jam   field   SILL… SIL     8_04 
##  7 M1980P462 8_0157.le… 8_0157  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_04 
##  8 M1980P471 8_0158.le… 8_0158  leaf  ecotyp… 2008   2011    3     Jam   field   JAM.… JAM     8_04 
##  9 M1982P655 8_0167.le… 8_0167  leaf  ecotyp… 2008   2011    3     Jam   field   PAR.… PAR     8_04 
## 10 M1981P623 8_0174.le… 8_0174  leaf  ecotyp… 2008   2011    3     Jam   field   MAH.… MAH     8_04 
## # … with 136 more rows, 3 more variables: oldPlate <chr>, newPlate <chr>, Analysis <chr>, and
## #   abbreviated variable names ¹Plant_ID, ²Experiment, ³Harvested, ⁴Treatment, ⁵Genotype

3) Using a single call to the filter function, get all leaf samples from site “Jam” that are of genotype “SIL” or “MIL”.

filter(sample_data,
       Type == "leaf",
       Site == "Jam",
       Genotype %in% c("SIL", "MIL"))

## # A tibble: 14 × 16
## # Groups:   Plant_ID [14]
##    SampleID  Name       Plant…¹ Type  Exper…² Cohort Harve…³ Age   Site  Treat…⁴ Line  Genot…⁵ Block
##    <chr>     <chr>      <chr>   <chr> <chr>   <chr>  <chr>   <chr> <chr> <chr>   <chr> <chr>   <chr>
##  1 M1982P661 8_0149.le… 8_0149  leaf  ecotyp… 2008   2011    3     Jam   field   SILL… SIL     8_04 
##  2 M1980P462 8_0157.le… 8_0157  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_04 
##  3 M1982P711 8_0326.le… 8_0326  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_07 
##  4 M1979P439 8_0609.le… 8_0609  leaf  ecotyp… 2008   2011    3     Jam   field   SILL… SIL     8_13 
##  5 M1981P586 8_0612.le… 8_0612  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_13 
##  6 M1980P481 8_0775.le… 8_0775  leaf  ecotyp… 2008   2011    3     Jam   field   SILL… SIL     8_17 
##  7 M1982P664 8_0803.le… 8_0803  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_17 
##  8 M1981P582 8_1023.le… 8_1023  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_22 
##  9 M1982P663 8_1048.le… 8_1048  leaf  ecotyp… 2008   2011    3     Jam   field   SILL… SIL     8_22 
## 10 M1981P639 8_1382.le… 8_1382  leaf  ecotyp… 2008   2011    3     Jam   field   MIL.… MIL     8_29 
## 11 M1981P626 8_1388.le… 8_1388  leaf  ecotyp… 2008   2011    3     Jam   field   SILL… SIL     8_29 
## 12 M1981P622 9_0171.le… 9_0171  leaf  ecotyp… 2009   2012    3     Jam   field   SILL… SIL     9_04 
## 13 M1982P718 9_0799.le… 9_0799  leaf  ecotyp… 2009   2012    3     Jam   field   SILL… SIL     9_17 
## 14 M1554P181 9_1364.le… 9_1364  leaf  ecotyp… 2009   2012    3     Jam   field   MIL.… MIL     9_29 
## # … with 3 more variables: oldPlate <chr>, newPlate <chr>, Analysis <chr>, and abbreviated variable
## #   names ¹Plant_ID, ²Experiment, ³Harvested, ⁴Treatment, ⁵Genotype

Subsetting a taxonomy

For the questions below, do not overwrite the obj with result of filtering (i.e. don’t assign the result to obj with <-).

4) The function n_obs counts the number of items in the first dataset (the OTU table in our case) assigned to each taxon. Try running n_obs(obj) to see how many OTUs are within each taxon.

n_obs(obj)

##   aad   aaf   aag   aah   aai   aaj   aak   aal   aam   aan   aao   aap   aaq   aar   aas   aau 
## 21684  7771  1781   443  1721   501   867  1150   417  1110  2367   106   382    68   719    20 
##   aav   aaw   aax   aay   aaz   aba   abb   abc   abd   abe   abf   abg   abi   abj   abk   abl 
##     8    17   426   448   158    66    42   163    23    11    32    61   444    86    56     6 
##   abm   abn   abo   abq   abr   abs   abt   abw   abx   aby   abz   acc   acg   aci   acr   act 
##     9    86     2    47     1    10    12     1     6     1     1     4     4     1     1     1 
##   acu   acx   add   ade   adf   adg   adh   adi   adj   adk   adl   adm   adn   ado   adp   adq 
##     1     3  2383   706   233  2015   327   535  2612   430   677    17   349   573   308   188 
##   adr   ads   adt   adu   adv   adw   adx   ady   adz   aea   aeb   aec   aed   aee   aeg   aeh 
##   287    44    20   292   149   177    84   382   227  1666   149   135   254   208   207    92 
##   aei   aej   aek   ael   aem   aen   aeo   aep   aeq   aer   aes   aet   aev   aex   aey   aez 
##    56     6    78    67   324    26    94    83    77   719   105     9    14    56    32   504 
##   afa   afb   afc   afe   aff   afg   afh   afi   afj   afk   afl   afm   afo   afp   afq   afr 
##    19   112    33     5   384   137     8    66     9   158    12    17    37     9    12   150 
##   afs   aft   afu   afv   afw   afx   afy   aga   agb   agc   agd   age   agg   agi   agk   agl 
##    95    30   193    54    56    21    23    81    32    61     1    11    55    14   286    16 
##   agm   agn   ago   agp   agq   agr   ags   agt   agu   agw   agx   agy   agz   ahb   ahc   ahd 
##    41   227     6    39    65    82     5    56    24     3     8     8     4    15    21     9 
##   ahe   ahf   ahg   ahh   ahk   ahn   aho   ahp   ahr   ahs   aht   ahu   ahw   ahx   ahy   aib 
##    14    72     2     3     1    17    11    13     9    30    40     4    16     2    48     7 
##   aie   aif   aig   aih   aii   aij   aim   ain   aip   aiq   air   aiu   aiw   aix   aiy   aja 
##     8    20     8     1     7     6     2    15     5     5    13     2     3     5    20     6 
##   ajb   ajc   ajg   ajm   ajn   ajo   ajq   ajw   ajz   akb   akg   akl   akw   akx   ame   amg 
##     1     2    14     4     1     4     2     3     2     4     1     1     1     1     1   482 
##   amh   ami   amj   amk   aml   amm   amn   amo   amp   amq   amr   ams   amt   amu   amv   amw 
##   181   482   694    78    44   327   232  1205   430   677   561    25   267    23    18    54 
##   amx   amy   amz   ana   anb   anc   and   ane   anf   ang   ani   anj   ank   anl   anm   ann 
##    11    86    27   310   345    75   308   188   168    52    19    97    16   177    71    64 
##   ano   anq   anr   ans   ant   anu   anv   anw   anx   any   aoa   aod   aoe   aof   aog   aoh 
##  1467   382   194   250   139    30    33   252   137   227     8    33    69    26    10    56 
##   aoi   aoj   aok   aom   aon   aoo   aoq   aor   aos   aot   aou   aov   aow   aox   aoy   apa 
##     6  1284    30     6    26   323    94   399    59    83    57    42   711    44     9    14 
##   apc   apd   ape   apg   aph   api   apj   apk   apl   apm   apn   app   apq   apr   aps   apt 
##    37    32   248   109     3   283    67    37    48   369    19    69    45    33    17    25 
##   apv   apw   apx   apy   aqa   aqb   aqf   aqh   aqi   aqj   aqk   aql   aqm   aqn   aqo   aqr 
##    23    23    20    40     5     8    24    20     5   153    13    67    18   104    12    78 
##   aqt   aqu   aqx   aqy   aqz   ara   arb   ard   are   arf   arg   arh   ari   arj   ark   arm 
##    12    34    23    37    31    78    78    20     4    46   141    54     9     3    12    40 
##   aro   arp   arq   ars   arw   arx   asa   asb   ase   asf   asg   asj   asl   asn   asp   asr 
##     2   147    21    47     3    32    15    18    11    33    14    25    42     9    14     9 
##   ast   asu   asy   atc   atd   ate   atf   atg   atj   atl   atn   atp   atq   atr   att   atv 
##     5    26     9     7    54    19     2     7     3    10     8     2     1    16    31     5 
##   atx   aty   aua   aue   auf   aug   auh   aui   auj   aul   auo   auq   aus   aut   auu   auw 
##     9    52     3    52     2     9     3     1     7    13     2     1     2     3    11     5 
##   aux   ava   avb   avc   ave   avi   avl   avm   avo   avq   avs   avv   avx   avz   awb   awe 
##     1     4    30    61     9     4     5     8     5     5     8     2     1     9    11     6 
##   awi   awj   awk   awn   awq   awr   awt   awu   awy   axa   axg   axj   axo   axq   axr   axt 
##     3     4     1     2    16     5     1     1     2     2     2     2     4     2     1     4 
##   axv   axx   axz   ayf   ayg   ayi   ayl   ayq   ayr   ayw   ayx   aze   azg   azh   azi   azp 
##     2     6     1     1     5     2     1     4     1     4     2     2     2     1     2     2 
##   azq   azv   azw   azx   azz   bam   baq   bay   bba   bby   bde   bds   bef   ben   beo   bep 
##     1     2     6     1     1     1     1     1     1     1     1     1     1   156   159    42 
##   beq   bes   bet   beu   bev   bew   bex   bey   bfa   bfb   bfc   bfd   bfe   bff   bfg   bfh 
##    13    44    27   103    47    39    14    97    63   373   614    12   307    43    17    25 
##   bfi   bfj   bfk   bfl   bfn   bfp   bfq   bfr   bfs   bft   bfu   bfv   bfw   bfx   bfy   bfz 
##   147   101    23    13    22   112    13   120     4    11    86    66   171    16    29    35 
##   bgc   bge   bgf   bgg   bgj   bgk   bgl   bgm   bgn   bgo   bgq   bgr   bgs   bgt   bgx   bgz 
##   305   113    33   117    11    16    44    39   227    46    88    16    47    64    66    23 
##   bha   bhb   bhc   bhe   bhf   bhg   bhj   bhm   bhn   bho   bhp   bhq   bhs   bhx   bhz   bia 
##   250    38    36     9    30   120    11    75    19   143    89    34    35    14    36     2 
##   bib   bid   bie   big   bih   bij   bik   bil   bio   bip   bis   bit   biu   biv   biw   bix 
##     8    45    89    10   143     6    14   990     1     7    62    35    16    67    10    90 
##   biy   biz   bja   bjb   bjc   bjd   bje   bjf   bjg   bjh   bji   bjk   bjl   bjm   bjn   bjp 
##    75    59    18    83   847    16    17     2    57    10   171     2   494    33     9    14 
##   bjv   bjy   bkd   bke   bkh   bki   bkl   bkn   bko   bkp   bkt   bku   bkv   bky   bkz   blb 
##   109    41    56    11    49    45    12    92    13     3    23    19   324     5     5    39 
##   blc   bld   blf   bli   bll   blm   bln   blp   blq   blr   bls   blu   blv   blw   blx   bma 
##     3   115     6     7    13    43   293    19     5   153    15    59     5    19    26    12 
##   bmc   bmh   bmk   bml   bmp   bmq   bmv   bnb   bnf   bni   bnl   bnn   bnp   bnt   bnx   bnz 
##     2    29   315     4     7    30     3     2    47     2    38    19    17     4     3    16 
##   boa   boc   boe   boh   boj   bom   bon   boo   bor   bos   bou   boy   bpa   bpb   bpf   bpi 
##     2    42     9    12     3    11     3    33     3     7     8    19     6    12     7     6 
##   bpj   bpk   bpq   bpv   bpw   bpy   bqc   bqj   bqk   bqm   bqp   bqq   bqr   bqs   bqt   bqw 
##     5     1     9    48    10     7     6     5     1     2    11    26    13    20    36     1 
##   bqy   bqz   brb   brd   brh   brm   brn   bro   brq   brr   brs   brw   bsb   bsc   bsd   bsf 
##     2     5     3     4     1    52     1     8     3     4     1     1     5     1     3     7 
##   bsj   bsl   bsq   bsr   bst   bsw   bsy   bsz   bta   btb   bte   btg   bti   btk   btr   btu 
##     1     1     3    10     2     7     1     2    12     8     1     1     4     3     9     1 
##   btw   btx   bua   buf   bug   buh   bui   buk   bul   buo   buq   bux   buz   bvc   bvd   bvf 
##     3     2     7     2     5     2     1     2     9     1     2     5     2     7     4     3 
##   bvg   bvh   bvi   bvs   bvu   bvv   bvw   bvz   bwd   bwe   bwf   bwh   bwk   bwn   bwq   bwr 
##    19     1     1     1     1     8     1     7     6     3     2     2     2     8     2     2 
##   bwu   bww   bxc   bxe   bxp   bxs   bxw   bxz   bye   byf   byh   byi   byk   byl   bym   byr 
##     1     2     5     1     4     1     3     3     1     1     1     5     2     1     3     1 
##   byt   byy   bza   bzb   bze   bzh   bzi   bzl   bzm   bzt   cab   cag   cal   cao   car   cau 
##     2     3     3     3     2     1     5     4     1     1     1     1     1     1     1     1 
##   caw   cbb   cbc   cbr   cbx   cbz   ccd   ccw   ccz   cds   ceq   cfe   cfp   cgf   cgz   chs 
##     2     1     3     1     1     1     1     1     2     1     1     1     3     1     1     1

5) Like taxon_names, n_obs can be used within calls to filter_taxa as if it were a variable. Try to use filter_taxa to remove all taxa that have less than 100 OTUs.

filter_taxa(obj, n_obs >= 100)

## <Taxmap>
##   122 taxa: aad. Bacteria, aaf. Proteobacteria ... bmk. Legionellaceae
##   122 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... aoj->bln, aqj->blr, ano->bmk
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

6a) Subset obj to just “Actinobacteria” and all of its subtaxa.

filter_taxa(obj, taxon_names == "Actinobacteria", subtaxa = TRUE)

## <Taxmap>
##   72 taxa: aag. Actinobacteria, ade. Actinobacteria ... cfp. Dermacoccaceae
##   72 edges: NA->aag, aag->ade, aag->ado, aag->ady ... amj->cbz, anq->cfe, amj->cfp
##   1 data sets:
##     otu_counts:
##       # A tibble: 1,781 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 beq      6             70     4055     206    2415       2    2901      56    2249
##       2 bev      11            45     1994      17    1741       0    1598      16    2717
##       3 bew      12           711     3553     501    2220       1    2410     477    2490
##       # … with 1,778 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

6b) Why are there fewer OTUs in the result of the above filtering?

Because the OTUs assinged to taxa besides Actinobacteria or one of its subtaxa were removed.

6c) Look at the documentation for filter_taxa by typing ?filter_taxa. See if you can modify the previous command so that no OTUs are removed from the table, but the taxonomy is still subset to “Actinobacteria” and all of its subtaxa.

filter_taxa(obj, taxon_names == "Actinobacteria", subtaxa = TRUE, drop_obs = FALSE)

## <Taxmap>
##   72 taxa: aag. Actinobacteria, ade. Actinobacteria ... cfp. Dermacoccaceae
##   72 edges: NA->aag, aag->ade, aag->ado, aag->ady ... amj->cbz, anq->cfe, amj->cfp
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 NA       2              7        0       2       7       1       7      24       4
##       2 NA       3          10251       59   15857     309     134      90   11391      63
##       3 NA       4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

6d) What happend to the taxon IDs for OTUs that are not in “Actinobacteria” and why?

The OTUs not in Actinobacteria were assinged a taxon ID of NA because there are no taxa left they could be reassigned to.

7a) The function taxon_ranks returns the ranks for each taxon and it can be used like taxon_names in calls to filter_taxa. Look at the result of taxon_ranks(obj). Subset obj to phyla (“p” in our data) and their supertaxa.

filter_taxa(obj, taxon_ranks == "p", supertaxa = TRUE)

## <Taxmap>
##   50 taxa: aad. Bacteria, aaf. Proteobacteria ... acu. GAL15, acx. Kazan-3B-28
##   50 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... aad->act, aad->acu, aad->acx
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 aaf      2              7        0       2       7       1       7      24       4
##       2 aaf      3          10251       59   15857     309     134      90   11391      63
##       3 aaf      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

7b) Were any OTUs removed? Why?

No, because they were reassigned to the phyla.

8a) Try to remove “Actinobacteria” and all of its subtaxa. You will probably need to look at the documentation for filter_taxa by typing ?filter_taxa to figure out how to do this.

filter_taxa(obj, taxon_names == "Actinobacteria", subtaxa = TRUE, invert = TRUE)

## <Taxmap>
##   632 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   632 edges: NA->aad, aad->aaf, aad->aah, aad->aai ... amg->cgf, arh->cgz, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

8b) Why were no OTUs removed when “Actinobacteria” and all of its subtaxa were removed?

Because the OTUs that were assigned to “Actinobacteria” were reassigned to “Bacteria”, its supertaxon.

8c) Try to modify the previous command to remove the OTUs that are assigned to “Actinobacteria”.

filter_taxa(obj, taxon_names == "Actinobacteria",
            subtaxa = TRUE,
            invert = TRUE,
            reassign_obs = FALSE)

## <Taxmap>
##   632 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   632 edges: NA->aad, aad->aaf, aad->aah, aad->aai ... amg->cgf, arh->cgz, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 19,903 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 19,900 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

Subsetting data associated with a taxonomy

For the questions below, do not overwrite the obj with result of filtering (i.e. don’t assign the result to obj with <-).

9a) Look at the documentation for filter_obs by typing ?filter_obs. Use filter_obs to remove all rows of “otu_counts” that have less than 5 reads in the first sample, “M1981P563”, without removing any taxa in the taxonomy.

filter_obs(obj, "otu_counts", M1981P563 > 5)

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   1 data sets:
##     otu_counts:
##       # A tibble: 235 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 232 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

# Also works: filter_obs(obj, "otu_counts", obj$data$otu_counts$M1981P563 > 5)

9b) Modify the previous command to remove taxa as well as rows in the table.

filter_obs(obj, "otu_counts", M1981P563 > 5, drop_taxa = TRUE)

## <Taxmap>
##   101 taxa: aad. Bacteria, aaf. Proteobacteria ... btg. Bogoriellaceae
##   101 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... aor->bkv, aoj->bln, amj->btg
##   1 data sets:
##     otu_counts:
##       # A tibble: 235 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 232 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##   0 functions:

10a) The function calc_n_samples from the metacoder package returns the number of samples each OTU appears in (i.e. has a non-zero count):

library(metacoder)
calc_n_samples(obj, data = "otu_counts")

## NOTE: Using the "groups" option without the "cols" option can yeild incorrect results if the column order is different from the group order.

## No `cols` specified, so using all numeric columns:
##    M1981P563, M1977P1709, M1980P502, M1981P606 ... M1955P747, M1978P342, M1955P778

## Calculating number of samples with a value greater than 0 for 292 columns for 21684 observations

## # A tibble: 21,684 × 2
##    taxon_id n_samples
##    <chr>        <int>
##  1 ben            245
##  2 beo            292
##  3 ben            292
##  4 bep            282
##  5 beq            279
##  6 amk            292
##  7 bes            279
##  8 bet            287
##  9 beu            288
## 10 bev            281
## # … with 21,674 more rows

Using the groups option, it can do the calculation on columns grouped by treatment:

calc_n_samples(obj, data = "otu_counts", groups = sample_data$Type)

## NOTE: Using the "groups" option without the "cols" option can yeild incorrect results if the column order is different from the group order.

## No `cols` specified, so using all numeric columns:
##    M1981P563, M1977P1709, M1980P502, M1981P606 ... M1955P747, M1978P342, M1955P778

## Calculating number of samples with a value greater than 0 for 292 columns in 2 groups for 21684 observations

## # A tibble: 21,684 × 3
##    taxon_id  leaf  root
##    <chr>    <int> <int>
##  1 ben        126   119
##  2 beo        146   146
##  3 ben        146   146
##  4 bep        136   146
##  5 beq        133   146
##  6 amk        146   146
##  7 bes        133   146
##  8 bet        143   144
##  9 beu        143   145
## 10 bev        135   146
## # … with 21,674 more rows

Run the following code to add this as a table to a copy of your taxmap object:

obj2 <- obj$clone(deep = TRUE)
obj2$data$sample_counts <- calc_n_samples(obj,
                                         data = "otu_counts",
                                         groups = sample_data$Type,
                                         other_cols = "OTU_ID") # Preserves OTU_ID column

## NOTE: Using the "groups" option without the "cols" option can yeild incorrect results if the column order is different from the group order.

## No `cols` specified, so using all numeric columns:
##    M1981P563, M1977P1709, M1980P502, M1981P606 ... M1955P747, M1978P342, M1955P778

## Calculating number of samples with a value greater than 0 for 292 columns in 2 groups for 21684 observations

print(obj2)

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   2 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##     sample_counts:
##       # A tibble: 21,684 × 4
##         taxon_id OTU_ID  leaf  root
##         <chr>    <chr>  <int> <int>
##       1 ben      2        126   119
##       2 beo      3        146   146
##       3 ben      4        146   146
##       # … with 21,681 more rows
##   0 functions:

Since the “sample_counts” table is derived from the “otu_counts” table and in the same order, we can use values from it to subset “otu_counts”. Using filter_obs, try to subset “otu_counts” to taxa that do not appear in leaf samples, but appear in at least 100 root samples. Remove any taxa not represented by these OTUs unique to roots.

filter_obs(obj2, "otu_counts", leaf == 0, root > 100, drop_taxa = TRUE)

## <Taxmap>
##   73 taxa: aad. Bacteria, aaf. Proteobacteria ... bln. Isosphaeraceae, blx. Hyphomonadaceae
##   73 edges: NA->aad, aad->aaf, aad->aag, aad->aai ... aoo->blb, aoj->bln, apj->blx
##   2 data sets:
##     otu_counts:
##       # A tibble: 49 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ami      757            0        3       0       1       0       7       0       4
##       2 amm      983            0       15       0       6       0      26       0      11
##       3 aqn      988            0       13       0       7       0      17       0       3
##       # … with 46 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##     sample_counts:
##       # A tibble: 7,339 × 4
##         taxon_id OTU_ID  leaf  root
##         <chr>    <chr>  <int> <int>
##       1 beo      3        146   146
##       2 amj      15        49   146
##       3 amo      17        59   146
##       # … with 7,336 more rows
##   0 functions:

10b) The function arrange_obs sorts tables in taxmap object. For example, you can sort the OTU table in obj by the number of reads in sample “M1981P563” like so:

arrange_obs(obj2, "otu_counts", M1981P563)

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   2 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 amo      17             0     1459       1    2164       0    2990       2     963
##       2 bfb      21             0        1       0       0       0       0       0       0
##       3 bfc      22             0      244       0     259       0      14       0     200
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##     sample_counts:
##       # A tibble: 21,684 × 4
##         taxon_id OTU_ID  leaf  root
##         <chr>    <chr>  <int> <int>
##       1 ben      2        126   119
##       2 beo      3        146   146
##       3 ben      4        146   146
##       # … with 21,681 more rows
##   0 functions:

To sort in descending order wrap the sorting parameter in desc():

arrange_obs(obj2, "otu_counts", desc(M1981P563))

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   2 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 beo      3          10251       59   15857     309     134      90   11391      63
##       2 bfb      25          3594        4    3372      31       2       3    2308       0
##       3 bex      13          3083       31    4407     102      26      39    4897      22
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##     sample_counts:
##       # A tibble: 21,684 × 4
##         taxon_id OTU_ID  leaf  root
##         <chr>    <chr>  <int> <int>
##       1 ben      2        126   119
##       2 beo      3        146   146
##       3 ben      4        146   146
##       # … with 21,681 more rows
##   0 functions:

Try to sort the OTU table by the difference between the number of leaf samples and OTU was found in and the number of root samples and OTU was found in. This effectively sorts by how “characteristic” an OTU is to root samples.

arrange_obs(obj2, "sample_counts", leaf - root)

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   2 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##     sample_counts:
##       # A tibble: 21,684 × 4
##         taxon_id OTU_ID  leaf  root
##         <chr>    <chr>  <int> <int>
##       1 bfw      436        5   146
##       2 bji      881        1   142
##       3 bil      927        4   145
##       # … with 21,681 more rows
##   0 functions:

10c) What is the ID of the OTU most “characteristic” of roots vs leafs?

436, 881, or 927

10d) mutate_obs is used to add new columns to tables in a taxmap objects. Look at the documentation and see if you can figure out how to add the results of the classifications function as a column in the “sample_counts” table. Note that you will need to use the “taxon_id” column of the “sample_counts” table to subset the results of the classifications function so that they match the taxon IDs in the “sample_counts” table.

mutate_obs(obj2, "sample_counts", class = classifications(obj2)[obj2$data$sample_counts$taxon_id])

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   2 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##     sample_counts:
##       # A tibble: 21,684 × 5
##         taxon_id OTU_ID  leaf  root class                                                 
##         <chr>    <chr>  <int> <int> <chr>                                                 
##       1 ben      2        126   119 Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsi…
##       2 beo      3        146   146 Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomo…
##       3 ben      4        146   146 Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsi…
##       # … with 21,681 more rows
##   0 functions:

10e) Now use the %>% operator to pipe the code for part d into the code for part b to sort it and see what taxa are most “characteristic” of roots vs leafs.

obj2 %>%
  mutate_obs("sample_counts", class = classifications(obj2)[obj2$data$sample_counts$taxon_id]) %>%
  arrange_obs("sample_counts", leaf - root)

## <Taxmap>
##   704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
##   704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
##   2 data sets:
##     otu_counts:
##       # A tibble: 21,684 × 294
##         taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
##         <chr>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##       1 ben      2              7        0       2       7       1       7      24       4
##       2 beo      3          10251       59   15857     309     134      90   11391      63
##       3 ben      4           1447     2496    2421    3331     143    1847    2120    2048
##       # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
##       #   M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
##       #   M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
##       #   abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
##       #   ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
##     sample_counts:
##       # A tibble: 21,684 × 5
##         taxon_id OTU_ID  leaf  root class                                                 
##         <chr>    <chr>  <int> <int> <chr>                                                 
##       1 bfw      436        5   146 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobial…
##       2 bji      881        1   142 Bacteria;Firmicutes;Bacilli;Bacillales;Paenibacillace…
##       3 bil      927        4   145 Bacteria;Planctomycetes;Planctomycetia;Gemmatales;Gem…
##       # … with 21,681 more rows
##   0 functions:

10f) What are the families of the OTUs most “characteristic” of roots vs leafs?

Hyphomicrobiaceae, Paenibacillaceae, and Gemmataceae

References

Wagner, Maggie R, Derek S Lundberg, G Tijana, Susannah G Tringe, Jeffery L Dangl, and Thomas Mitchell-Olds. 2016. “Host Genotype and Age Shape the Leaf and Root Microbiomes of a Wild Perennial Plant.” Nature Communications 7: 12151. https://doi.org/10.1038/ncomms12151.

Wickham, Hadley, and Romain Francois. 2015. “Dplyr: A Grammar of Data Manipulation.” R Package Version 0.4 3.