NOTE: THIS PROJECT IS UNDER DEVELOPMENT AND MAY NOT FUNCTION AS EXPECTED UNTIL THIS MESSAGE GOES AWAY

Introduction

nf-core/pathogensurveillance is a population genomic pipeline for pathogen diagnosis, variant detection, and biosurveillance. The pipeline accepts the paths to raw reads for one or more organisms (in the form of a CSV file) and creates reports in the form of interactive HTML reports or PDF documents. Significant features include the ability to analyze unidentified eukaryotic and prokaryotic samples, creation of reports for multiple user-defined groupings of samples, automated discovery and downloading of reference assemblies from NCBI RefSeq, and rapid initial identification based on k-mer sketches followed by a more robust core genome phylogeny and SNP-based phylogeny.

Background - why use pathogensurveilance?

TL;DR: unknown FASTQ -> sample ID + phylogeny + publication-quality figures

Most existing genomic tools are designed to be used with a reference genome. Unfortunately, this is a luxury in the pathogen diagnostic world, where researchers often are faced with unknown samples. This creates a conundrum: how can you draw a point of comparison without a starting point?

Pathogensurveilance addresses this by picking a good reference genome for you through a process of k-mer sketching. In most cases, this will be the best reference. It is therefore a great option for identification of an unknown sample, or when you would like some empirical data suggesting which references are available in nearby species. Conversely, if you already have a good understanding of your system, you also have the ability to supply which reference genome should be used.

Pathogensurveilance uses several strategies to address the emerging complexity of working with sequences that could anything from plants to bacteria to fungi. First, pathogensurveilance uses reasonable baseline parameters that work very well with most species. There are some processes where a one-size-fits-all strategy isn’t feasible. In such cases, Pathogensurveilance will change the analysis workflow automatically (e.g. assembling Prokaryotic vs Eukaryotic core genes). These analysis branchpoints are facilitated by the Nextflow framework, and require no additional inputs from the user.

While pathogensurveilance may be a useful tool for researchers of all levels, it was designed with clinicians in mind who may have limited bioinformatics training. Pathogensurveilance is very simple to run. At a minimum, all that needs to be supplied is a .CSV file with a single column specifying the path to your sample’s sequencing reads. For more experienced users, there are opportunities to customize the parameters of the pipeline. In particular, the final report is designed for configuration through by nature of Quarto and the psminer plugin system.

Pathogensurveilance is particularly good for:

  • unknown sample identification
  • exploratory population analysis using minimal input parameters
  • inexperienced bioinformatics users
  • efficient parallelization of tasks
  • repeated analysis where you would like to compare a few new samples to a past run.

Pathogensurveilance is not designed for:

  • viral sequence
  • non gDNA datasets (RNA-seq, RAD-seq, ChIP-seq, etc.)
  • mixed/impure samples
  • Researchers who want to use a highly configurable pipeline, or those who would like to extensively test the pipeline at each stage
Note

Pathogensuveilance manages its many software dependancies through containers. The pipeline works with Linux machines and HPC clusters, but Singularity and Docker have compatability issues with MacOS and Windows computers.

Pipeline summary

This is a quick breakdown of the processes used by pathogensurveilance:

  • Download sequences and references if they are not provided locally
  • Quickly obtain several initial sample references (bbmap)
  • More accurately select appropriate reference genomes. Genome “sketches” are compared between first-pass references, samples, and any references directly provided by the user (sourmash)
  • Genome assembly
    • Illumina shortreads: (spades)
    • Pacbio or Oxford Nanopore longreads: (flye)
  • Genome annotation (bakta)
  • Align reads to reference sequences (bwa)
  • Variant calling and filtering (graphtyper, vcflib)
  • Determine relationship between samples and references
    • Build SNP tree from variant calls (iqtree)
    • For Prokaryotes:
      • Identify shared orthologs (pirate)
      • Build tree from core genome phylogeny (iqtree)
    • For Eukaryotes:
      • Identify BUSCO genes (busco)
      • Build tree from BUSCO genes (read2tree)
  • Generate interactive html report/pdf file
    • Sequence and assembly information (fastqc, multiqc, quast)
    • sample identification tables and heatmaps
    • Phylogenetic trees from genome-wide SNPs and core genes
    • minimum spanning network

Flowchart