Introduction
This documentation page provides additional information on to use the
demulticoder
R package for processing and
analyzing metabarcode sequencing data. Specifically, it provides more
detail on input directory and file requirements, and key parameters.
Package workflow
- Prepare your input files (
metadata.csv
,primerinfo_params.csv
, unformatted reference databases, and PE Illumina read files). - Place all input files in a single directory.
- Ensure your file names comply with the specified format.
- Run the four steps/functions of pipeline with default settings or adjust parameters as needed.
Data directory structure
Place all your input files into a single directory. The directory should contain the following files:
- PE Illumina read files
metadata.csv
primerinfo_params.csv
- Reference databases
Read Name Format
To avoid errors, the only characters that are acceptable in sample
names are letters and numbers. Characters can be separated by
underscores, but no other symbols. The files must end with the suffix
R1.fastq.gz
or
R2.fastq.gz
Examples of permissible sample names are as follows:
- Sample1_R1.fastq.gz
- Sample1_R2.fastq.gz
Other permissible names are:
- Sample1_001_R1.fastq.gz
- Sample1_001_R2.fastq.gz
What is not permissible is:
- Sample1_001_R1_001.fastq.gz
- Sample1_001_R2_001.fastq.gz
Metadata file components
The metadata.csv file contains information about the samples and primers (and associated metabarcodes) used in the experiment. It has the following two required columns:
- sample_name: Identifier for each sample (e.g., S1, S2)
-
primer_name: Name of the primer used (applicable
options:
rps10
,its
,r16S
,other1
,other2
)
Please add your associated metadata to the file after these two
required columns. This can then be used for your downstream exploratory
or diversity analyses, as the sample data will be incorporated into the
final phyloseq
and
taxmap
objects.
Example file (with optional third column):
sample_name,primer_name,organism
S1,rps10,Cry
S2,rps10,Cin
S1,its,Cry
S2,its,Cin
Primer and parameter file components
The primerinfo_params.csv file contains information
about the primer sequences used in the experiment, along with optional
additional parameters that are part of the
DADA2
pipeline. If anything is not
specified, the default values will be used.
Required columns:
-
primer_name
: Name of the primer/barcode (e.g.,its
,rps10
) -
forward
: Forward primer sequence -
reverse
: Reverse primer sequence
Below are the parameters that can be input into the
primerinfo_params.csv file along with the defaults.
Refer to the DADA2
documentation and manual
for additional information.
DADA2
filterAndTrim
function
parameters:
-
already_trimmed
: Boolean indicating if primers are already trimmed (TRUE/FALSE
) (default:FALSE
) -
minCutadaptlength
:Cutadapt
parameter-Filter out processed reads that are shorter than specified length (default:0
) -
multithread
: Boolean for multithreading (TRUE/FALSE
) (default:FALSE
) -
verbose
: Boolean for verbose output (TRUE/FALSE
) (default:FALSE
) -
maxN
: Maximum number of N bases allowed (default:0
) -
maxEE_forward
: Maximum expected errors for forward reads (default:Inf
) -
maxEE_reverse
: Maximum expected errors for reverse reads (default:Inf
) -
truncLen_forward
: Truncation length for forward reads (default:0
) -
truncLen_reverse
: Truncation length for reverse reads (default:0
) -
truncQ
: Truncation quality threshold (default:2
) -
minLen
: Minimum length of reads after processing (default:20
) -
maxLen
: Maximum length of reads after processing (default:Inf
) -
minQ
: Minimum quality score (default:0
) -
trimLeft
: Number of bases to trim from the start of reads (default:0
) -
trimRight
: Number of bases to trim from the end of reads (default:0
) -
rm.lowcomplex
: Boolean for removing low complexity sequences (default:TRUE
)
DADA2
learnErrors
function
parameters:
-
nbases
: Number of bases to use for error rate learning (default:1e+08
) -
randomize
: Randomize reads for error rate learning (default:FALSE
) -
MAX_CONSIST
: Maximum number of self-consistency iterations (default:10
) -
OMEGA_C
: Convergence threshold for the error rates (default:0
) -
qualityType
: Quality score type ("Auto"
,"FastqQuality",
or"ShortRead"
) (default:"Auto"
)
DADA2
plotErrors
parameters:
-
err_out
: Return the error rates used for inference (default:TRUE
) -
err_in
: Use input error rates instead of learning them (default:FALSE
) -
nominalQ
: Use nominal Q-scores (default:FALSE
) -
obs
: Return the observed error rates (default:TRUE
)
DADA2
dada
function
parameters:
-
OMP
: Use OpenMP multi-threading if available (default:TRUE
) -
n
: Number of reads to use for error rate estimation (default:1e+05
) -
id.sep
: Character separating sample ID from sequence name (default:"\\s"
) -
orient.fwd
: NULL or TRUE/FALSE to orient sequences (default:NULL
) -
pool
: Pool samples for error rate estimation (default:FALSE
) -
selfConsist
: Perform self-consistency iterations (default:FALSE
)
DADA2
mergePairs
function
parameters:
-
minOverlap
: Minimum overlap for merging paired-end reads (default:12
) -
maxMismatch
: Maximum mismatches allowed in the overlap region (default:0
)
DADA2
removeBimeraDenovo
function
parameters:
-
method
: Method for sample inference ("consensus"
or"pooled"
) (default:"consensus"
)
DADA2
assignTaxonomy
function
parameters:
-
minBoot
: The minimum bootstrap confidence for assigning a taxonomic level (default:0
)
-
tryRC
: If TRUE, the reverse-complement of each sequences will be used for classification if it is a better match to the reference sequences than the forward sequence (default:FALSE
)
Other parameters to include in CSV input file:
-
min_asv_length
: Minimum length of Amplicon Sequence Variants (ASVs) after core dada ASV inference steps (default =0
)
-
seed
: For greater reproducibility, user can specify an integer to set as a seed to use when the followingDADA2
functions are run:plotQualityProfile
,learnErrors
,dada
,makeSequenceTable
, andassignTaxonomy
(default:NULL
)
Example file (with select optional columns after forward and reverse primer sequence columns):
primer_name,forward,reverse,already_trimmed,minCutadaptlength,multithread,verbose,maxN,maxEE_forward,maxEE_reverse,truncLen_forward,truncLen_reverse,truncQ,minLen,maxLen,minQ,trimLeft,trimRight,rm.lowcomplex,minOverlap,maxMismatch,min_asv_length
rps10,GTTGGTTAGAGYARAAGACT,ATRYYTAGAAAGAYTYGAACT,FALSE,100,TRUE,FALSE,1.00E+05,5,5,0,0,5,150,Inf,0,0,0,0,15,0,50
its,CTTGGTCATTTAGAGGAAGTAA,GCTGCGTTCTTCATCGATGC,FALSE,50,TRUE,FALSE,1.00E+05,5,5,0,0,5,50,Inf,0,0,0,0,15,0,50
Reference Databases
Databases will be copied into the user-specified data folder where raw data files and csv files are located. The names will be parameters in the assignTax function.
For now, the package is compatible with the following databases:
oomyceteDB
from: https://grunwaldlab.github.io/OomyceteDB/-
SILVA 16S database
with species assignments: https://www.arb-silva.de/- An easily accessible download is found here: https://zenodo.org/records/14169026
UNITE fungal database
from https://zenodo.org/records/14169026Up to two other reference databases. The user will need to reformat headers exactly as outlined here. The user can then specify the path to the database in the input file. The database should be in FASTA format.