Introduction
This documentation provides comprehensive information on how to use the Demulticoder R package for processing and analyzing metabarcode sequencing data. It covers input file requirements, parameter settings, and key parameters.
Quick Start Guide
- Prepare your input files (metadata.csv, primerinfo_params.csv, unformatted reference databases, and PE Illumina read files).
- Place all input files in a single directory.
- Ensure your file names comply with the specified format.
- Run the pipeline with default settings or adjust parameters as needed.
Input Files
Directory Structure
Place all your input files into a single directory. The directory should contain the following files:
- metadata.csv
- primerinfo_params.csv
- PE Illumina read files
- Unformatted reference databases
File Naming Conventions
Read Name Format
To avoid errors, the only characters that are acceptable in sample names are letters and numbers. Characters can be separated by underscores, but no other symbols. The files must end with the suffix R1.fastq.gz or R2.fastq.gz
Examples of permissible sample names are as follows:
- Sample1_R1.fastq.gz
- Sample1_R2.fastq.gz
Other permissible names are:
- Sample1_001_R1.fastq.gz
- Sample1_001_R2.fastq.gz
What is not permissible is:
- Sample1_001_R1_001.fastq.gz
- Sample1_001_R2_001.fastq.gz
metadata.csv
The metadata.csv file contains information about the samples and primers (and associated barcodes) used in the experiment. It has the following two required columns:
- sample_name: Identifier for each sample (e.g., S1, S2)
- primer_name: Name of the primer used (e.g., rps10, its, r16S, other1, oteher2)
Please add your associated metadata to the file after these two required columns. This can then be used for your downstream exploratory or diversity analyses, as the sample data will be incorporated into the final phyloseq and taxmap objects.
Example file (with optional third column):
sample_name,primer_name,organism
S1,rps10,Cry
S2,rps10,Cin
S1,its,Cry
S2,its,Cin
primerinfo_params.csv
The primerinfo_params.csv file contains information about the primer sequences used in the experiment, along with optional additional parameters that are part of the DADA2 pipeline. If anything is not specified, the default values will be used.
Required columns:
- primer_name: Name of the primer/barcode (e.g., its, rps10)
- forward: Forward primer sequence
- reverse: Reverse primer sequence
DADA2 filterAndTrim function parameters:
- already_trimmed: Boolean indicating if primers are already trimmed (TRUE/FALSE) (default: FALSE)
- minCutadaptlength: Cutadapt parameter-Minimum length after Cutadapt trimming (default: 0)
- multithread: Boolean for multithreading (TRUE/FALSE) (default: FALSE)
- verbose: Boolean for verbose output (TRUE/FALSE) (default: FALSE)
- maxN: Maximum number of N bases allowed (default: 0)
- maxEE_forward: Maximum expected errors for forward reads (default: Inf)
- maxEE_reverse: Maximum expected errors for reverse reads (default: Inf)
- truncLen_forward: Truncation length for forward reads (default: 0)
- truncLen_reverse: Truncation length for reverse reads (default: 0)
- truncQ: Truncation quality threshold (default: 2)
- minLen: Minimum length of reads after processing (default: 20)
- maxLen: Maximum length of reads after processing (default: Inf)
- minQ: Minimum quality score (default: 0)
- trimLeft: Number of bases to trim from the start of reads (default: 0)
- trimRight: Number of bases to trim from the end of reads (default: 0)
- rm.lowcomplex: Boolean for removing low complexity sequences (default: TRUE)
DADA2 learnErrors function parameters:
- nbases: Number of bases to use for error rate learning (default: 1e+08)
- randomize: Randomize reads for error rate learning (default: FALSE)
- MAX_CONSIST: Maximum number of self-consistency iterations (default: 10)
- OMEGA_C: Convergence threshold for the error rates (default: 0)
- qualityType: Quality score type (“Auto”, “FastqQuality”, or “ShortRead”) (default: “Auto”)
DADA2 plot errors parameters:
- err_out: Return the error rates used for inference (default: TRUE)
- err_in: Use input error rates instead of learning them (default: FALSE)
- nominalQ: Use nominal Q-scores (default: FALSE)
- obs: Return the observed error rates (default: TRUE)
DADA2 dada function parameters:
- OMP: Use OpenMP multi-threading if available (default: TRUE)
- n: Number of reads to use for error rate estimation (default: 1e+05)
- id.sep: Character separating sample ID from sequence name (default: “\s”)
- orient.fwd: NULL or TRUE/FALSE to orient sequences (default: NULL)
- pool: Pool samples for error rate estimation (default: FALSE)
- selfConsist: Perform self-consistency iterations (default: FALSE)
DADA2 mergePairs function parameters:
- minOverlap: Minimum overlap for merging paired-end reads (default: 12)
- maxMismatch: Maximum mismatches allowed in the overlap region (default: 0)
DADA2 removeBimeraDenovo function parameters:
- method: Method for sample inference (“consensus” or “pooled”) (default: “consensus”)
Other parameters to include in CSV input file:
- min_asv_length: Minimum length of Amplicon Sequence Variants (ASVs) after core dada ASV inference steps (default=0)
Example file (with select optional columns after forward and reverse primer sequence columns):
primer_name,forward,reverse,already_trimmed,minCutadaptlength,multithread,verbose,maxN,maxEE_forward,maxEE_reverse,truncLen_forward,truncLen_reverse,truncQ,minLen,maxLen,minQ,trimLeft,trimRight,rm.lowcomplex,minOverlap,maxMismatch,min_asv_length
rps10,GTTGGTTAGAGYARAAGACT,ATRYYTAGAAAGAYTYGAACT,FALSE,100,TRUE,FALSE,1.00E+05,5,5,0,0,5,150,Inf,0,0,0,0,15,0,50
its,CTTGGTCATTTAGAGGAAGTAA,GCTGCGTTCTTCATCGATGC,FALSE,50,TRUE,FALSE,1.00E+05,5,5,0,0,5,50,Inf,0,0,0,0,15,0,50
Reference Database
Databases will be copied into the user-specified data folder where raw data files and csv files are located. The names will be parameters in the assignTax function.
For now, the package is compatible with the following databases:
oomycetedb from: http://www.oomycetedb.org/
SILVA 16S database with species assignments: https://zenodo.org/records/4587955/files/silva_nr99_v138.1_wSpecies_train_set.fa.gz?download=1
UNITE fungal database from https://unite.ut.ee/repository.php
Up to two other reference databases. The user will need to reformat headers exactly as outlined in the DADA2 database format, and similar to the SILVA database format. The user can then specify the path to the database in the input file. The database should be in fasta format.