Tabular data is relatively straight forward to manipulate if you know some R already, but taxonomic data is hierarchical, making it much harder to work with. It gets even more complicated when there are other data associated with the taxonomy.
For example, if we want to remove a bacterial phylum from the data, do we also remove all of the taxa within that phylum? If so, how do we identify which taxa are in that phylum? Depending on the format the data is in, this might be hard. If other data (e.g. OTU counts) are associated with taxa within that phylum, do we also remove those too? After all, even if the phylum and the taxa within it are removed, we still know that all the associated data are bacterial, so do we reassign them to the Bacteria taxon, or just throw those data out? The answer to these kind of questions will depend on our data and what we want to do with it.
If you are starting the workshop at this section, or had problems running code in a previous section, use the following to load the data used in this section. You can download the “parsed_data.Rdata” file here. If you have just done the previous section, you can ignore this and proceed.
load("parsed_data.Rdata")
The
functions
to subset taxonomic information in taxmap
objects (the
format our example data is in) provided by the metacoder
package are modeled after functions in the dplyr
package
for manipulating tabular data (Wickham and
Francois (2015)). These dplyr
functions are
alternatives to the standard R subsetting (e.g. [
,
[[
, and $
) designed to be be more intuitive
and consistent. Since an understanding of these functions will help with
the analogous (and more complicated) functions for taxonomic data, we
will start by practicing these. Fortunately, we have a tabular data set
in need of some subsetting: our sample data.
First, lets take another look at our sample data:
print(sample_data)
## # A tibble: 1,698 × 16
## SampleID Name Plant…¹ Type Exper…² Cohort Harve…³ Age Site Treat…⁴ Line Genot…⁵ Block
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 M1024P1788 R_026.ro… R_026 root fieldB… 2008 2011 3 LTM field 26 ril NA
## 2 M1024P1791 R_073.ro… R_073 root fieldB… 2008 2011 3 LTM field 73 ril NA
## 3 M1024P1795 R_088.ro… R_088 root fieldB… 2009 2011 2 LTM field 88 ril NA
## 4 M1024P1807 R_156.ro… R_156 root fieldB… 2009 2011 2 LTM field 156 ril NA
## 5 M1955P804 1_A_1.ro… 1_A_1 root ecoGH NA 2011 NA Duke MAHsoil par1… PAR b01
## 6 M1956P837 1_A_12.r… 1_A_12 root ecoGH NA 2011 NA Duke MAHsoil mah3… MAH b01
## 7 M1957P983 1_A_3.ro… 1_A_3 root ecoGH NA 2011 NA Duke MAHsoil silL… SIL b01
## 8 M1956P845 1_A_4.ro… 1_A_4 root ecoGH NA 2011 NA Duke MAHsoil par1… PAR b01
## 9 M1957P987 1_A_7.ro… 1_A_7 root ecoGH NA 2011 NA Duke MAHsoil par9… PAR b01
## 10 M1957P923 1_A_8.ro… 1_A_8 root ecoGH NA 2011 NA Duke MAHsoil mil5… MIL b01
## # … with 1,688 more rows, 3 more variables: oldPlate <chr>, newPlate <chr>, Analysis <chr>, and
## # abbreviated variable names ¹Plant_ID, ²Experiment, ³Harvested, ⁴Treatment, ⁵Genotype
This is a large data set and not all the 1698 samples are used in the main publication of Wagner et al. (2016). Only the samples in the “ecotypes” experiment are part of the main study. Using standard R subsetting we could get just those rows like so:
<- sample_data[sample_data$Experiment == "ecotypes", ] sample_data
Using the dplyr
functions, the same operation would
be:
library(dplyr)
<- filter(sample_data, Experiment == "ecotypes") sample_data
Note how we could use the Experiment
column on its own
within the filter
functions. That is a special property of
the filter
function and many others in dplyr
known as
Non-standard
evaluation (NSE). If we tried that outside of the function, we
would get an error. For example, the following will not work:
<- Experiment == "ecotypes" is_ecotype
## Error in eval(expr, envir, enclos): object 'Experiment' not found
Most dplyr
functions work this way and so do the
analogous functions in metacoder
for manipulating taxonomic
data. If we have multiple filtering criteria, we can simply add more
arguments to the functions. In the following code, we subset the table
to only rows for plants from one of three site and that are 3 years
old.
<- filter(sample_data,
sample_data %in% c("Mah", "Jam", "Sil"),
Site == 3) Age
This is usually equivalent to combining multiple conditions with
&
in normal subsetting, but is faster for large data
sets. It also will not return values for which conditions evaluate to
NA
, unlike
base R
subsetting.
Up until now, everything we have done using filter
can
be easily done with base R subsetting as well, but dplyr
introduces a few more advanced features that can save time/typing. We
wont go into these in detail, but its good to know they exist. One is
the concept of “grouping”. When rows are “grouped” by the values in one
column, the rows corresponding to each unique value in that column are
treated as a unit in some functions. n()
is one such
function that counts the number of items in each group. We can use this
to filter out any plants that don’t have both root and leaf samples by
counting the number of times each “Plant_ID” shows up:
<- sample_data %>%
sample_data group_by(Plant_ID) %>%
filter(n() == 2)
This leaves us with 292 samples. Note the use of the
%>%
“piping” operator. This takes the result of the
function that comes before and uses it as the first argument to function
that comes after. For example, seq(1:10) %>% length()
is
the same as length(seq(1:10))
. This allows for chaining
many commands together in a readable way. Next we will subset the
abundance data to just these samples to get the dataset down to a
manageable size for a standard personal computer.
The analog to filter
for taxonomic data is
filter_taxa
from the metacoder
package. This
works in a similar way, but has many more options for how taxonomic
relationships are used and what is done with data associated with
removed taxa. Lets take another look at our data object:
print(obj)
## <Taxmap>
## 1558 taxa: aab. Unassigned, aac. Root ... chx. Methanospirillaceae, chy.
## 1558 edges: NA->aab, NA->aac, aac->aad, aac->aae ... bel->chw, ays->chx, bem->chy
## 1 data sets:
## otu_counts:
## # A tibble: 47,806 × 1,707
## taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aab 1 0 0 0 0 0 0 0 0
## 2 ben 2 41 22 4 726 112492 2 413 2
## 3 beo 3 67 65 12 13514 1 4 13314 70
## # … with 47,803 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
## # M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
## # M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
## # names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
## # ⁷M1551P90
## 0 functions:
If we look at the second line, we can see that some taxa do not have
names (the taxon with the ID “chy”). These are the ones that only had
rank
information in the classification we parsed in the last section
(e.g. o__
in
"Root;k__Bacteria;p__Chlorobi;c__SJA-28;o__;f__"
). Since
these do not add any information, lets remove them.
library(metacoder)
<- filter_taxa(obj, taxon_names != "")
obj print(obj)
## <Taxmap>
## 959 taxa: aab. Unassigned, aac. Root ... chs. Alcanivoracaceae, chx. Methanospirillaceae
## 959 edges: NA->aab, NA->aac, aac->aad, aac->aae ... anq->cho, apt->chs, ays->chx
## 1 data sets:
## otu_counts:
## # A tibble: 47,806 × 1,707
## taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aab 1 0 0 0 0 0 0 0 0
## 2 ben 2 41 22 4 726 112492 2 413 2
## 3 beo 3 67 65 12 13514 1 4 13314 70
## # … with 47,803 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
## # M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
## # M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
## # names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
## # ⁷M1551P90
## 0 functions:
There were OTUs associated with the nameless taxa, but no OTUs were
filtered out, so what happened to them? By default, they are reassigned
to the closest supertaxon that passes the filter. For example, for the
classification
"Root;k__Bacteria;p__Chlorobi;c__SJA-28;o__;f__"
, the OTU
would be reassigned to the class “SJA-28” and the taxon ID for that OTU
would be changed to the taxon ID for “SJA-28”.
Like filter
, filter_taxa
can use variables
contained in the input objects as if they were independent variables. In
this case, taxon_names
is actually a function that is
called when referenced in this way.
head(taxon_names(obj))
## aab aac aad aae aaf
## "Unassigned" "Root" "Bacteria" "Archaea" "Proteobacteria"
## aag
## "Actinobacteria"
Other values that can be used this way include columns in tables and
lists/vectors
stored in the obj$data
list in taxmap
objects.
Since taxmap
objects can store any number of lists,
vectors, or tables mapped to a taxonomy, there can be many variables
that can be referenced by name. The all_names
function
returns the name and location of variables that can be used this
way:
head(all_names(obj), 20)
## taxon_names taxon_ids
## "taxon_names" "taxon_ids"
## taxon_indexes classifications
## "taxon_indexes" "classifications"
## n_supertaxa n_supertaxa_1
## "n_supertaxa" "n_supertaxa_1"
## n_subtaxa n_subtaxa_1
## "n_subtaxa" "n_subtaxa_1"
## n_leaves n_leaves_1
## "n_leaves" "n_leaves_1"
## taxon_ranks is_root
## "taxon_ranks" "is_root"
## is_stem is_branch
## "is_stem" "is_branch"
## is_leaf is_internode
## "is_leaf" "is_internode"
## n_obs n_obs_1
## "n_obs" "n_obs_1"
## data[['otu_counts']][['OTU_ID']] data[['otu_counts']][['M1024P1833']]
## "OTU_ID" "M1024P1833"
length(all_names(obj))
## [1] 1724
Next, lets also filter out anything not in Bacteria, since the focus of the study was Bacteria.
<- filter_taxa(obj, taxon_names == "Bacteria", subtaxa = TRUE)
obj print(obj)
## <Taxmap>
## 898 taxa: aad. Bacteria, aaf. Proteobacteria ... cho. wb1_P06, chs. Alcanivoracaceae
## 898 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... bea->chi, anq->cho, apt->chs
## 1 data sets:
## otu_counts:
## # A tibble: 33,641 × 1,707
## taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ben 2 41 22 4 726 112492 2 413 2
## 2 beo 3 67 65 12 13514 1 4 13314 70
## 3 ben 4 3229 13679 1832 951 113 2496 567 2428
## # … with 33,638 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
## # M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
## # M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
## # names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
## # ⁷M1551P90
## 0 functions:
Note the use of the subtaxa = TRUE
; this means that all
the subtaxa of taxa that pass the filter should also be preserved. If we
left this option off, we would would have only a single taxon with all
OTUs assigned to it:
filter_taxa(obj, taxon_names == "Bacteria")
## <Taxmap>
## 1 taxa: aad. Bacteria
## 1 edges: NA->aad
## 1 data sets:
## otu_counts:
## # A tibble: 33,641 × 1,707
## taxon_id OTU_ID M1024P1833 M1551…¹ M1551…² M1551…³ M1551…⁴ M1551…⁵ M1551…⁶ M1551…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aad 2 41 22 4 726 112492 2 413 2
## 2 aad 3 67 65 12 13514 1 4 13314 70
## 3 aad 4 3229 13679 1832 951 113 2496 567 2428
## # … with 33,638 more rows, 1,697 more variables: M1551P71 <dbl>, M1551P12 <dbl>,
## # M1551P84 <dbl>, M1551P48 <dbl>, M1551P4 <dbl>, M1551P52 <dbl>, M1551P3 <dbl>,
## # M1551P15 <dbl>, M1551P31 <dbl>, M1551P75 <dbl>, …, and abbreviated variable
## # names ¹M1551P81, ²M1551P57, ³M1551P85, ⁴M1551P28, ⁵M1551P29, ⁶M1551P38,
## # ⁷M1551P90
## 0 functions:
Note how the taxon IDs for OTUs change when taxa are filtered. Next, lets remove the columns from the abundance matrix that were removed from the sample data at the start of the section.
$data$otu_counts <- obj$data$otu_counts[c("taxon_id", "OTU_ID", sample_data$SampleID)] obj
This leaves us with a data set of 33,641 OTUs in 292 samples, classified by 898 taxa:
print(obj)
## <Taxmap>
## 898 taxa: aad. Bacteria, aaf. Proteobacteria ... cho. wb1_P06, chs. Alcanivoracaceae
## 898 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... bea->chi, anq->cho, apt->chs
## 1 data sets:
## otu_counts:
## # A tibble: 33,641 × 294
## taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ben 2 7 0 2 7 1 7 24 4
## 2 beo 3 10251 59 15857 309 134 90 11391 63
## 3 ben 4 1447 2496 2421 3331 143 1847 2120 2048
## # … with 33,638 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
## # M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
## # M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
## # abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
## # ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
## 0 functions:
In the previous section we subset the taxonomy in a
taxmap
object using filter_taxa
. By default,
when taxa are removed, the OTUs assigned to them were reassigned to a
supertaxon or removed as well. We can also subset the OTU table directly
and remove any taxa that are no longer needed using the function
filter_obs
. The “obs” in the function name stands for
“observations”. Unlike filter
, which you used previously to
subset tables, filter_obs
works on a data set in a
taxmap
object, like our OTU table.
In the last section we removed samples (i.e. columns in the OTU
table) that did not appear in the Wagner et al.
(2016) publication. This probably means that some of the OTUs
only existed in those samples and now have no reads in the remaining
samples. Lets remove those to make the data set more manageable. First
we need to find which OTUs (i.e. rows) no have no reads. We can do that
by using the function rowSums
to get the number of reads in
all samples for each OTU.
<- rowSums(obj$data$otu_counts[, sample_data$SampleID]) == 0 has_no_reads
Since the table otu_counts
also has the taxon ID and OTU
ID columns we need to subset the columns to just the samples before
using rowSums
. The command above results in a
logical
vector, which is a series of
TRUE
/FALSE
values that can be used to subset
things. We can use sum
to count the number of
TRUE
values and hence the number of OTUs without reads.
sum(has_no_reads)
## [1] 11957
There are 11,957 of 33,641 OTUs that now have no reads. Now that we
know which they are, we can remove them using
filter_obs
.
filter_obs(obj, "otu_counts", ! has_no_reads) # note the ! negation operator
## <Taxmap>
## 898 taxa: aad. Bacteria, aaf. Proteobacteria ... cho. wb1_P06, chs. Alcanivoracaceae
## 898 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... bea->chi, anq->cho, apt->chs
## 1 data sets:
## otu_counts:
## # A tibble: 21,684 × 294
## taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ben 2 7 0 2 7 1 7 24 4
## 2 beo 3 10251 59 15857 309 134 90 11391 63
## 3 ben 4 1447 2496 2421 3331 143 1847 2120 2048
## # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
## # M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
## # M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
## # abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
## # ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
## 0 functions:
Note how this removed rows in the “otu_counts” table but did not
remove any taxa. To remove taxa that are no longer observed in any OTUs,
we need to add the drop_taxa
option:
<- filter_obs(obj, "otu_counts", ! has_no_reads, drop_taxa = TRUE)
obj print(obj)
## <Taxmap>
## 704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
## 704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
## 1 data sets:
## otu_counts:
## # A tibble: 21,684 × 294
## taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ben 2 7 0 2 7 1 7 24 4
## 2 beo 3 10251 59 15857 309 134 90 11391 63
## 3 ben 4 1447 2496 2421 3331 143 1847 2120 2048
## # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
## # M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
## # M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
## # abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
## # ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
## 0 functions:
If there were other tables in obj$data
the data for
those taxa would have been removed as well unless
drop_obs = FALSE
was added. Since there is only the one
table, we don’t need to worry about that.
In these exercises, we will be using obj
and
sample_data
from the analysis above. If you did not run the
code above or had problems, run the following code to get the objects
used. You can download the “filtered_data.Rdata” file
here.
load("filtered_data.Rdata")
1) Try to subset sample_data
to just
“leaf” samples using standard R subsetting
(e.g. my_table[rows, columns]
).
2) Try to do the same thing using the
filter
function from the dplyr
package.
3) Using a single call to the filter
function, get all leaf samples from site “Jam” that are of genotype
“SIL” or “MIL”.
For the questions below, do not overwrite the obj
with
result of filtering (i.e. don’t assign the result to obj
with <-
).
4) The function n_obs
counts the number
of items in the first dataset (the OTU table in our case) assigned to
each taxon. Try running n_obs(obj)
to see how many OTUs are
within each taxon.
5) Like taxon_names
, n_obs
can be used within calls to filter_taxa
as if it were a
variable. Try to use filter_taxa
to remove all taxa that
have less than 100 OTUs.
6a) Subset obj
to just “Actinobacteria”
and all of its subtaxa.
6b) Why are there fewer OTUs in the result of the above filtering?
6c) Look at the documentation for
filter_taxa
by typing ?filter_taxa
. See if you
can modify the previous command so that no OTUs are removed from the
table, but the taxonomy is still subset to “Actinobacteria” and all of
its subtaxa.
6d) What happend to the taxon IDs for OTUs that are not in “Actinobacteria” and why?
7a) The function taxon_ranks
returns
the ranks for each taxon and it can be used like
taxon_names
in calls to filter_taxa
. Look at
the result of taxon_ranks(obj)
. Subset obj
to
phyla (“p” in our data) and their supertaxa.
7b) Were any OTUs removed? Why?
8a) Try to remove “Actinobacteria” and all of its
subtaxa. You will probably need to look at the documentation for
filter_taxa
by typing ?filter_taxa
to figure
out how to do this.
8b) Why were no OTUs removed when “Actinobacteria” and all of its subtaxa were removed?
8c) Try to modify the previous command to remove the OTUs that are assigned to “Actinobacteria”.
For the questions below, do not overwrite the obj
with
result of filtering (i.e. don’t assign the result to obj
with <-
).
9a) Look at the documentation for
filter_obs
by typing ?filter_obs
. Use
filter_obs
to remove all rows of “otu_counts” that have
less than 5 reads in the first sample, “M1981P563”, without removing any
taxa in the taxonomy.
9b) Modify the previous command to remove taxa as well as rows in the table.
10a) The function calc_n_samples
from
the metacoder
package returns the number of samples each
OTU appears in (i.e. has a non-zero count):
library(metacoder)
calc_n_samples(obj, data = "otu_counts")
## NOTE: Using the "groups" option without the "cols" option can yeild incorrect results if the column order is different from the group order.
## No `cols` specified, so using all numeric columns:
## M1981P563, M1977P1709, M1980P502, M1981P606 ... M1955P747, M1978P342, M1955P778
## Calculating number of samples with a value greater than 0 for 292 columns for 21684 observations
## # A tibble: 21,684 × 2
## taxon_id n_samples
## <chr> <int>
## 1 ben 245
## 2 beo 292
## 3 ben 292
## 4 bep 282
## 5 beq 279
## 6 amk 292
## 7 bes 279
## 8 bet 287
## 9 beu 288
## 10 bev 281
## # … with 21,674 more rows
Using the groups
option, it can do the calculation on
columns grouped by treatment:
calc_n_samples(obj, data = "otu_counts", groups = sample_data$Type)
## NOTE: Using the "groups" option without the "cols" option can yeild incorrect results if the column order is different from the group order.
## No `cols` specified, so using all numeric columns:
## M1981P563, M1977P1709, M1980P502, M1981P606 ... M1955P747, M1978P342, M1955P778
## Calculating number of samples with a value greater than 0 for 292 columns in 2 groups for 21684 observations
## # A tibble: 21,684 × 3
## taxon_id leaf root
## <chr> <int> <int>
## 1 ben 126 119
## 2 beo 146 146
## 3 ben 146 146
## 4 bep 136 146
## 5 beq 133 146
## 6 amk 146 146
## 7 bes 133 146
## 8 bet 143 144
## 9 beu 143 145
## 10 bev 135 146
## # … with 21,674 more rows
Run the following code to add this as a table to a copy of your
taxmap
object:
<- obj$clone(deep = TRUE)
obj2 $data$sample_counts <- calc_n_samples(obj,
obj2data = "otu_counts",
groups = sample_data$Type,
other_cols = "OTU_ID") # Preserves OTU_ID column
## NOTE: Using the "groups" option without the "cols" option can yeild incorrect results if the column order is different from the group order.
## No `cols` specified, so using all numeric columns:
## M1981P563, M1977P1709, M1980P502, M1981P606 ... M1955P747, M1978P342, M1955P778
## Calculating number of samples with a value greater than 0 for 292 columns in 2 groups for 21684 observations
print(obj2)
## <Taxmap>
## 704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
## 704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
## 2 data sets:
## otu_counts:
## # A tibble: 21,684 × 294
## taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ben 2 7 0 2 7 1 7 24 4
## 2 beo 3 10251 59 15857 309 134 90 11391 63
## 3 ben 4 1447 2496 2421 3331 143 1847 2120 2048
## # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
## # M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
## # M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
## # abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
## # ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
## sample_counts:
## # A tibble: 21,684 × 4
## taxon_id OTU_ID leaf root
## <chr> <chr> <int> <int>
## 1 ben 2 126 119
## 2 beo 3 146 146
## 3 ben 4 146 146
## # … with 21,681 more rows
## 0 functions:
Since the “sample_counts” table is derived from the “otu_counts”
table and in the same order, we can use values from it to subset
“otu_counts”. Using filter_obs
, try to subset “otu_counts”
to taxa that do not appear in leaf samples, but appear in at least 100
root samples. Remove any taxa not represented by these OTUs unique to
roots.
10b) The function arrange_obs
sorts
tables in taxmap
object. For example, you can sort the OTU
table in obj
by the number of reads in sample “M1981P563”
like so:
arrange_obs(obj2, "otu_counts", M1981P563)
## <Taxmap>
## 704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
## 704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
## 2 data sets:
## otu_counts:
## # A tibble: 21,684 × 294
## taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 amo 17 0 1459 1 2164 0 2990 2 963
## 2 bfb 21 0 1 0 0 0 0 0 0
## 3 bfc 22 0 244 0 259 0 14 0 200
## # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
## # M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
## # M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
## # abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
## # ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
## sample_counts:
## # A tibble: 21,684 × 4
## taxon_id OTU_ID leaf root
## <chr> <chr> <int> <int>
## 1 ben 2 126 119
## 2 beo 3 146 146
## 3 ben 4 146 146
## # … with 21,681 more rows
## 0 functions:
To sort in descending order wrap the sorting parameter in
desc()
:
arrange_obs(obj2, "otu_counts", desc(M1981P563))
## <Taxmap>
## 704 taxa: aad. Bacteria, aaf. Proteobacteria ... cgz. S24-7, chs. Alcanivoracaceae
## 704 edges: NA->aad, aad->aaf, aad->aag, aad->aah ... amg->cgf, arh->cgz, apt->chs
## 2 data sets:
## otu_counts:
## # A tibble: 21,684 × 294
## taxon_id OTU_ID M1981P563 M1977P…¹ M1980…² M1981…³ M1981…⁴ M1981…⁵ M1981…⁶ M1982…⁷
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beo 3 10251 59 15857 309 134 90 11391 63
## 2 bfb 25 3594 4 3372 31 2 3 2308 0
## 3 bex 13 3083 31 4407 102 26 39 4897 22
## # … with 21,681 more rows, 284 more variables: M1979P447 <dbl>, M1982P701 <dbl>,
## # M1982P661 <dbl>, M1978P293 <dbl>, M1980P462 <dbl>, M1555P221 <dbl>,
## # M1980P471 <dbl>, M1979P422 <dbl>, M1982P655 <dbl>, M1554P182 <dbl>, …, and
## # abbreviated variable names ¹M1977P1709, ²M1980P502, ³M1981P606, ⁴M1981P599,
## # ⁵M1981P611, ⁶M1981P587, ⁷M1982P729
## sample_counts:
## # A tibble: 21,684 × 4
## taxon_id OTU_ID leaf root
## <chr> <chr> <int> <int>
## 1 ben 2 126 119
## 2 beo 3 146 146
## 3 ben 4 146 146
## # … with 21,681 more rows
## 0 functions:
Try to sort the OTU table by the difference between the number of leaf samples and OTU was found in and the number of root samples and OTU was found in. This effectively sorts by how “characteristic” an OTU is to root samples.
10c) What is the ID of the OTU most “characteristic” of roots vs leafs?
10d) mutate_obs
is used to add new
columns to tables in a taxmap
objects. Look at the
documentation and see if you can figure out how to add the results of
the classifications
function as a column in the
“sample_counts” table. Note that you will need to use the “taxon_id”
column of the “sample_counts” table to subset the results of the
classifications
function so that they match the taxon IDs
in the “sample_counts” table.
10e) Now use the %>%
operator to
pipe the code for part d into the code for part
b to sort it and see what taxa are most
“characteristic” of roots vs leafs.
10f) What are the families of the OTUs most “characteristic” of roots vs leafs?