Glossary

Amplicon

A piece of DNA produced by PCR.

Amplicon sequence variants (ASVs)

Also called Exact Sequence Variants (ESVs). ASVs are the inferred unique sequences present in the original sample, after correcting for sequencing and PCR errors. See the following for reasons to use ASVs instead of OTUs:

Callahan, Benjamin J., Paul J. McMurdie, and Susan P. Holmes. “Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.” The ISME journal 11.12 (2017): 2639.

Analysis of variance (ANOVA)

A statistical technique to determine of one of two or more sample means is different from the others. If a the result is significant, it still does not tell you which of the means were different, just that at least one was different. To see which of the means are different from the others, you can use the Tukey’s Honest Significant Difference (HSD) test.

Base R

This is the term used for all of the R functions that are loaded by default when starting R, without installing or loading any packages explicitly.

Capture groups

A way of specifying subsets of a regular expressions that are of interest, often for the purpose of extracting the values that those portions match. They are specified by parentheses and do not change what the regular expression matches. For example, extracting the capture group in “John ([a-z]+)” would return the last name of people with the first name “John”.

Chimeric sequences

Sequences composed of two or more pieces of unrelated DNA caused by “errors” during PCR when an incomplete amplicon acts as a primer for a different template in a subsequent cycle.

Class

A class is a defined set of variables along with a set of functions designed to work with those variable. The specifics of how classes are structured vary greatly between programming languages, but the concepts are similar. For example, you might have a class called “Dog” that contained the dogs age (number), the dogs breed (text), and the name of the dogs owner (text). With those variables, the “Dog” class might have functions that make the dog a year older or change the owner of the dog, etc.

Comma-delimited text file

Also known as comma-separated value (CSV) format and usually has the file extension .csv. A plain text file used to store tables. Each row is one line and columns are separated by commas (i.e. ,).

Compositional data

When counts have a fixed total regardless of the actual abundance of things counted. All microbiome data is compositional because we sequence some number of reads regardless of how many PCR amplicons there were or template DNA molecules; i.e. we don’t get more reads from samples with more DNA. This means the number of reads for a given organism does not relate to its abundance, but it abundance relative to other organisms in the community (assuming no other biases). Its important to keep this in mind because many common statistical techniques assume independence of observations and read counts are not independent. For example, if you have 10 reads of a organism in one sample and 5 reads in another, it could be that the organisms is equally abundant in both, but the second community just has a lot more other species as well.

Doubletons

A sequence that only appears twice in a sample or in all samples, depending on the exact definition being used.

FASTQ

A file format used to store DNA sequences with associated per-base quality scores, often made by DNA sequencers. The format is similar to FASTA, but with an extra few lines per sequence. A FASTQ file might look like this:

@SEQ_ID1 other info...
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID2 other info...
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Every sequence has four lines:

  1. Like a FASTA file, the first line for each sequence is a header starting with @ and can contain anything.
  2. The DNA sequence
  3. Always starts with a +
  4. The per-base quality scores in ASCII code order. They correspond to numbers.

Function

Any command or operation the does something in a programming language is a function. Functions often have inputs that influence what the output is, but some don’t have inputs. Functions will usually return some type of output, but they might not, or they might have an effect besides what they return (this is rare in R, but common in other programming languages). The concept of functions, like variables, comes from math. For example, the equation for a line is y = mx + b. In R, you could make a function to return y given the values of m, x, and b, like so:

line <- function(m, x, b) {
  return(m * x + b)
}

And find the value for y, for a given set of inputs like so:

line(m = 2.5, x = 3, b = -1)
## [1] 6.5

Hexadecimal color codes

A way of encoding colors using 6 numbers or letters. Colors on a computer are made by varying the intensity of red, green, and blue. The intensity of each color is encoded from 0-255 (1 byte) and converted to a base 16 numbering system that uses the numbers 0-9 and the letters A-F, so it only takes 2 digits to encode 256 values. Three pairs of two digits corresponding to the intensity of red, green, and blue make up a hexadecimal color code. For example, “#FF0000” is the most intense red and corresponds to red = 255, green = 0, and blue = 0. Black is “#000000” and white is “#FFFFFF”.

Inverse Simpson index

An alternate way of encoding the Simpson index with numbers greater than 1. Assuming a theoretically community where all species were equally abundant, this would be the number of species needed to have the same Simpson index value for the community being analyzed.

List

A common R data type used to hold an ordered set of other R data of any type. Unlike a vector, a single list can have data of multiple types. For example you can make a list of vectors of different types:

list(1:3, "bob", c(TRUE, FALSE))
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "bob"
## 
## [[3]]
## [1]  TRUE FALSE

Multiple comparison corrections

The p-value in a statistical test measures the likelihood that an such a result or more extreme would occur by chance if the experiment was run repeatedly (making lots of assumptions about the variance of the test statistic). If many tests are run on subsets of the data, then the chance that at least one has a “significant” p value by chance goes up and the p-value for each test is no longer accurate. There are a set of techniques to correct these p-values called multiple comparison corrections. Commons ones include the False discovery rate (FDR) and Bonferroni corrections.

Non-standard evaluation (NSE)

This is a rather advanced programming technique that allows for code in a function call to be interpreted by R in a different way from how it would be interpreted outside that function call. It is used by many R functions to make them easier to read and reduce typing. For example, the library function uses NSE to allow users to leave off the quotes when naming R packages to import. You can call the library function this way:

library("metacoder")

However, using NSE, you can call it this way:

library(metacoder)
## This is metacoder verison 0.3.5 (stable)

Even though the variable metacoder does not exist outside the function call:

print("metacoder")
## [1] "metacoder"
print(metacoder)
## Error in print(metacoder): object 'metacoder' not found

This is used extensively in many newer R packages like dplyr and taxa.

Object

An instance of a class. In other words, a data with a defined type and functions designed to operate on it. For example, if you had a class for “Dog”, you might have an object of that class stored in a variable called “fido” and another called “scraps”.

Operational Taxonomic Units (OTUs)

OTUs are clusters of similar sequences often intended to correspond to some taxonomic rank, such as species. They are used to estimate diversity and account for sequencing error. Different barcodes (i.e. genes/loci) evolve at different rates, so the how similar sequences must be to be grouped together will vary depending on taxonomic group, the barcode used, and what taxonomic rank OTUs are intended to correspond to (if any). There are two types of OTUs: de novo and “closed reference”. De novo OTUs are constructed based on pairwise distances between sequences and do not rely on reference databases. Closed reference OTUs (aka phylotypes) are made by comparing sequences to references databases and clustering base on distance from the most similar reference sequence. For the limitations of OTUs and alternative approaches, see:

Callahan, Benjamin J., Paul J. McMurdie, and Susan P. Holmes. “Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.” The ISME journal 11.12 (2017): 2639.

Parsing

The word “parsing” is used in different ways, but in the context of data science, it means to transform data from one form to another. For example, to read a text file into R and store it as a data.frame would be to parse that file since the data is changing forms.

Phylotypes

There seems to be some variation in how the term “phylotype” is used, but here we will use the following definition. Phylotypes are groupings of sequences based on their similarity to a reference sequence. These differ from de novo OTUs due to their reliance on a similar reference sequence. They are the same as “closed reference” OTUs.

Pipelines

A term used for a series of programs (often automated) used to process data where each program takes the output of the one before as input. The term is generally used when the individual programs used are useful on their own for a specific purpose, rather then for small all-purpose tools like R functions.

Plain text

Plain text is the term used to describe text editors like Notepad or TextEdit that do not have fonts, images, or other non-text things. When you write R code, you are writing plain text. All programming languages use plain text because it is simple and has very few dependencies. Some programs for editing plain text might highlight relevant patterns in different colors, but these colors are specific to the program used to view the text, not the text itself.

R console

This is the text prompt you use to interact with R. When R is started, the R console will look something like this:

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 

R package

An R package is a set of user-defined functions organized so that people can easily share and use them. Most of the functions used by most R users are from R packages rather than those supplied by base R. R packages can be installed in a few ways, but the most common is to download them from The Comprehensive R Archive Network (CRAN) using the install.packages function. For example stringr is an R package that supplies functions to work with text.

install.packages("stringr")

Once installed, a package must be “loaded” using the library function before any functions it supplies can be used:

library("stringr")

Now we can use functions from the stringr package.

R project

“R projects” are an RStudio concept and are integral to many peoples workflow and organization. An R project is just a folder that has a file in it ending in “.Rproj”. This file is created by RStudio when you create a new project. Although R projects are not needed, we highly recommend using them for the following reasons:

  • They help with organization, since they encourage you to put all the code and data for a project in a single directory.
  • They help standardize your current working directory. Each time a project is opened, your current working directory is automatically switched to the project directory.
  • They store where you left off when you last closed RStudio. Depending on how you set things up, its as if you never closed RStudio at all. All the variables and files will be where you last left them. Even unsaved files. NOTE: It is recommended that you do not rely on restoring variables between sessions, although RStudio will offer.

Whenever you start something in R that you want to save, we recommend using an R project. You can make a new project by clicking on the upper right drop-down menu or “File > New Project”.

Random number generator seeds

Random number generators are used by computers to simulate randomness, but are not actually random. They work by taking a starting number and running that number through a function that returns another number, which is then run through the same function to produce another number and so on. The first number is the “seed” and a given seed will always produce the same series of random numbers. Random number generators appear random because the seed that is chosen when the generator is created is usually something like the milliseconds on your computer’s internal clock. You can however set the seed yourself if you always want the same “random” behavior using the set.seed function. For example:

rnorm(3) # produces 3 "random" numbers from a normal distribution
## [1] -1.81768698 -0.05350538 -0.93724088
rnorm(3)
## [1] -1.2963560 -1.6398278 -0.3956159
set.seed(1)
rnorm(3)
## [1] -0.6264538  0.1836433 -0.8356286
set.seed(1)
rnorm(3)
## [1] -0.6264538  0.1836433 -0.8356286

Rarefaction

Subsets counts of types (e.g. OTUs or species) in a sample to some total number of counts. For example, a sample with 4 counts of A and 2 counts of B, rarefied to a total of 3, would (on average) become 2 counts of A and 1 count of B. This is used to transform read counts to simulate equal sampling depth, since different samples usually have different numbers of reads, due to unavoidable inconsistencies in high-throughput sequencing. This is important when evaluating the relative diversity among a set of samples, since higher numbers of reads mean rare species are more likely to be observed.

Regular expressions

Also knows as a “regex”. It is a kind of computer language used to specify and search for patterns in plain text. It is widely used in other languages like R, python, and Perl. Most regular expressions are composed of a series of “what to match” followed by “how many times to match”. For example, “John [a-z]+” would match any instance of the word “John” followed by a space and one ore more lower case letters. Regular expressions can be very complicated, but are also very useful.

RStudio

A program used write and organize projects with R code. It can run R, but is not R itself, and relies on a separate installation of R to work.

Singleton

A sequence that only appears once in a sample or in all samples, depending on the exact definition being used.

Subtaxa

The taxa contained within another taxon. For example, Homo sapiens is a subtaxon of the genus Homo.

Supertaxa

The taxa a taxon is contained within. For example, Homo is a supertaxon of the species Homo sapiens.

Tab-delimited text file

Also known as tab-separated value (TSV) format and usually has the file extension .tsv. A plain text file used to store tables. Each row is one line and columns are separated by tabs (i.e. \t).

Taxonomic classifications

The set of nested taxa an organism belongs to. For example, the taxonomic classification of Homo sapiens is:

Animalia > Chordata > Mammalia > Primates > Haplorhini > Simiiformes > Hominidae > Homininae > Hominini > Homo > H. sapiens

Taxonomic ranks

Ranks are the level a taxon appears in a nested hierarchy of taxa. Common ranks include species, genus, family, order, class, phylum, and domain, although there might be others.

Taxonomy

In general, taxonomy is the study of classifying things in to groups. Typically these groups are nested inside eachother (i.e. a hierarchy). For our purpose “taxonomy” referrs to how we organize lifeforms into phyla, families, genera, species, etc. See the entry on “Taxonomic classifications” for more information.

The Comprehensive R Archive Network (CRAN)

A volunteer-run organization that hosts R packages and enforces standards for how they should be structured. When you install an R package using install.packages, you are installing from CRAN. CRAN is one of the major reasons R packages are so easy to install.

Tibble

An enhanced data.frame with better appearance when printed to the console and more consistent behavior. Tibbles do not allow for row names, since their designer, Hadley Wickham, thinks all data should be treated the same and row names are a kind of “special” case. Here is an example of a tibble compared to a data.frame:

# data.frame
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# tibble
dplyr::as.tbl(mtcars)
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows

Transpose

To “turn” a table or matrix 90 degrees, making the rows into columns and the columns into rows.

Tukey’s Honest Significant Difference (HSD)

A test the can be used after an ANOVA to tell which of a set of samples means are different from each other by performing pairwise comparisons.

Variable

In programming, a variable is a name associated with a value that can change or “vary”. This is similar to how the word is used in math. For example, the equation for a line is y = mx + b. In this equation, all of the letters are variables that can represent any number.

Vector

An ordered set of data of the same type. This is one of the most common types of data used in R. Any number or piece of text in R is a vector. For example, typing 5 produces a numeric vector of length 1:

5
## [1] 5

And typing 1:10 produces a numeric vector of length 10:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

Vectors can also be other types like character:

c("hello", "world")
## [1] "hello" "world"

Vignette

A short tutorial-like document included in R packages to help new users get familiar with the package. These can be found online or accessed in the installed package using the browseVignettes function. For example, you can see vegan’s vignettes by typing the following into an R console:

browseVignettes(package = "vegan")

Wilcoxon Rank Sum test

A non-parametric test (i.e. does not rely on a normal distribution) that tests if a randomly selected value from one population is greater on average than a randomly selected value from another. It can be thought of as the equivalent of a t-test, but only takes into account the if a value is greater than another value, rather than how much greater it is.