Pathogen Surveillance Report

Summary

This report is produced by the nf-core/pathogensurveillance pipeline.

Report group: xan_test
Sample count: 29
Last updated: May 31 , 2024
Pipeline version: 0.1

Pipeline Status Report

This section provides an overview of the pipeline execution status, including a summary table, detailed sample-specific issues, and group-level issues for further analysis.

About this table

This table provides a high-level overview of pipeline steps where issues were detected. This tables tells you how many groups or samples have issues that require your attention.

About this table

This table dives into the issues that impact entire report groups, providing a more granular view of problems not limited to individual samples but affecting the group as a whole. It helps in identifying systemic issues or errors in group-specific processes.

About this table

This table offers detailed insights into issues specific to individual samples. It is designed to help you pinpoint and address sample-specific problems, facilitating targeted troubleshooting and resolution efforts.

Input data

Identification

Initial identification

The following data provides tentative classifications of the samples based on exact matches of a subset of short DNA sequences. These are intended to be preliminary identifications. For more robust identifications based on whole genome sequences, see the results of the core genome phylogeny below.

Taxonomic classification summary
Per-sample classification

Initial classification of 29 samples identified all of them as:

Bacteria > Proteobacteria > Gammaproteobacteria > Xanthomonadales > Xanthomonadaceae > Xanthomonas > Xanthomonas hortorum

About this table

This table shows the “highest scoring” tentative taxonomic classification for each sample. Included metrics can provide insights into how each sample compares with reference genomes on online databases and the likelihood these comparisions are valid.

Sample: The sample ID submitted by the user.
WKID: Weighted k-mer Identity, adjusted for genome size differences.
ANI: An estimate of average nucleotide identity (ANI), derived from WKID and kmer length.
Completeness: The percentage of the reference genome represented in the query.
Top Hit: The name of the reference genome most similar to each sample based on the scoring criteria used.

Most similar organisms

About this table

This table shows the Average Nucleotide Identity (ANI) between each sample and the 2 references most similar to it based on this measure. ANI is used to measure how similar the shared portion of two genomes are. Note that this measure only takes into account the shared portion of genomes, so differences like extra plasmids or chromosomal duplications are not taken into account.

About this plot

This plot shows the results of comparing the similarity of all samples and references to each other. These similarity metrics are based on the presence and abundance of short exact sequence matches between samples (i.e. comparisons of k-mer sketches). These measurements are not as reliable as the methods used to create phylogenetic trees, but may be useful if phylogenetic trees could not be inferred for these samples.

About this table

This table shows the Percentage Of Conserved Proteins (POCP) between each sample and the 2 references most similar to it based on this measure. POCP is used to measure the proportion of proteins shared between two genomes. Which proteins are shared is determined from pairwise comparisons of all proteins between all genomes. The POCP between two genomes is the sum of the number of shared proteins in each genome divided by the sum of the number of total proteins in each genome (Qin et al. 2014). Currently, POCP is only calculated for Prokaryotes.

About this plot

This plot shows the results of comparing the protein content of all samples and references to each other. POCP is used to measure the proportion of proteins shared between two genomes. Which proteins are shared is determined from pairwise comparisons of all proteins between all genomes. The POCP between two genomes is the sum of the number of shared proteins in each genome divided by the sum of the number of total proteins in each genome (Qin et al. 2014). Currently, POCP is only calculated for Prokaryotes.

Phylogenetic context

This section includes phylogenetic trees of samples with references sequences downloaded from RefSeq meant to provide a reliable identification using genome-scale data. The accuracy of this identification depends on the presence of close reference sequences in RefSeq and the accuracy of the initial identification.

Core gene phylogeny

About this plot

This a core gene phylogeny of samples with RefSeq genomes for context. A core gene phylogeny uses the sequences of all gene shared by all of the genomes included in the tree to infer evolutionary relationships. It is the most robust identification provided by this pipeline, but its precision is still limited by the availability of similar reference sequences.

Genetic diversity

SNP trees

22_331_assembly

29 samples with 1269 variants aligned to reference “22_331_assembly”:

About this plot

This is a representation of a Single Nucleotide Polymorphism (SNP) tree, depicting the genetic relationships among samples in comparison to a reference assembly.

The tree is less robust than a core gene phylogeny and cannot offer insights on evolutionary relationships among strains, but it does offer one way to visualize the genetic diversity among samples, with genetically similar strains clustering together.

Question-does it make sense to be showing the reference within the tree?

29 samples aligned to “22_331_assembly”:

Threshold:

29 samples aligned to “22_331_assembly”:

Threshold:

About this plot

This figure depicts a minimium spanning network (MSN). The nodes represent unique multiocus genotypes, and the size of nodes is proportional to the # number of samples that share the same genotype.

The edges represent the SNP differences between two given genotypes, and the darker the color of the edges, the fewer SNP differences between the two.

Note: within these MSNs, edge lengths are not proportional to SNP differences.

References

Methods

The pathogen surveillance pipeline used the following tools that should be referenced as appropriate:

A sample is first identified to genus using sendsketch and further identified to species using sourmash (Brown and Irber 2016).
The nextflow data-driven computational pipeline enables deployment of complex parallel and reactive workflows (Di Tommaso et al. 2017).

Input settings

Add settings used to run Nextflow and the pipeline parameters.

Analysis software

module	program	version	citation
ALIGN_FEATURE_SEQUENCES	mafft	7.520	Katoh et al. (2002)
BAKTA_BAKTA	bakta	1.9.2	Schwengers et al. (2021)
BBMAP_SENDSKETCH	bbmap	39.01	Bushnell (2014)
BGZIP_MAKE_GZIP	tabix	1.12	Li (2011)
BWA_INDEX	bwa	0.7.17-r1188	Li and Durbin (2009)
BWA_MEM	bwa	0.7.17-r1188	Li and Durbin (2009)
BWA_MEM	samtools	1.18	Danecek et al. (2021)
CUSTOM_DUMPSOFTWAREVERSIONS	python	3.12.0
CUSTOM_DUMPSOFTWAREVERSIONS	yaml	6.0.1
DOWNLOAD_ASSEMBLIES	datasets	16.0.0	Sayers et al. (2022)
FASTP	fastp	0.23.4	Chen (2023)
FASTQC	fastqc	0.12.1	Andrews et al. (2010)
FILTER_ASSEMBLY	python	3.9.1
FIND_ASSEMBLIES	xtract	16.2
GATK4_VARIANTFILTRATION	gatk4	4.3.0.0	Van der Auwera and O’Connor (2020)
GRAPHTYPER_GENOTYPE	graphtyper	2.7.2	Eggertsson et al. (2017)
GRAPHTYPER_VCFCONCATENATE	graphtyper	2.7.2	Eggertsson et al. (2017)
INITIAL_CLASSIFICATION	r-base	4.2.1	R Core Team (2021)
IQTREE2_CORE	iqtree	2.1.4-beta	Nguyen et al. (2015)
IQTREE2_SNP	iqtree	2.1.4-beta	Nguyen et al. (2015)
KHMER_TRIMLOWABUND	khmer	3.0.0a3	Crusoe et al. (2015)
MAFFT_SMALL	mafft	7.520	Katoh et al. (2002)
PICARD_ADDORREPLACEREADGROUPS	picard	3.1.1	“Picard Toolkit” (2019)
PICARD_CREATESEQUENCEDICTIONARY	picard	3.1.1	“Picard Toolkit” (2019)
PICARD_MARKDUPLICATES	picard	3.1.1	“Picard Toolkit” (2019)
PICARD_SORTSAM_1	picard	3.1.1	“Picard Toolkit” (2019)
PIRATE	pirate	1.0.5	Bayliss et al. (2019)
QUAST	quast	5.2.0	Mikheenko et al. (2018)
REFORMAT_PIRATE_RESULTS	pirate	1.0.5	Bayliss et al. (2019)
SAMPLESHEET_CHECK	r-base	4.2.1	R Core Team (2021)
SAMTOOLS_FAIDX	samtools	1.18	Danecek et al. (2021)
SAMTOOLS_INDEX	samtools	1.18	Danecek et al. (2021)
SOURMASH_COMPARE	sourmash	4.6.1	Brown and Irber (2016)
SOURMASH_SKETCH_GENOME	sourmash	4.8.4	Brown and Irber (2016)
SOURMASH_SKETCH_READS	sourmash	4.8.4	Brown and Irber (2016)
SPADES	spades	3.15.5	Prjibelski et al. (2020)
SUBSET_READS	seqkit	2.2.0	Shen et al. (2016)
TABIX_TABIX	tabix	1.12	Li (2011)
VCFLIB_VCFFILTER	vcflib	1.0.3	Garrison et al. (2022)
VCF_TO_SNPALN	perl	5.32.1 built for x86_64-linux-thread-multi
VCF_TO_TAB	vcftools	0.1.16	Danecek et al. (2011)
Workflow	Nextflow	23.10.1	Di Tommaso et al. (2017)
Workflow	nf-core/plantpathsurveil	1.0dev

version and packages

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psminer_0.1.0        leaflet_2.2.2        rcrossref_1.2.0     
 [4] ggdendro_0.2.0       metacoder_0.3.7      webshot2_0.1.1      
 [7] kableExtra_1.4.0     ggnewscale_0.4.10    phangorn_2.11.1     
[10] visNetwork_2.1.2     igraph_2.0.3         ggtree_3.11.1       
[13] poppr_2.9.6          adegenet_2.1.10      ade4_1.7-22         
[16] palmerpenguins_0.1.0 lubridate_1.9.3      forcats_1.0.0       
[19] stringr_1.5.1        tidyr_1.3.1          tibble_3.2.1        
[22] tidyverse_2.0.0      heatmaply_1.5.0      viridis_0.6.5       
[25] viridisLite_0.4.2    plotly_4.10.4        pheatmap_1.0.12     
[28] magrittr_2.0.3       ape_5.7-1            phylocanvas_0.1.3   
[31] yaml_2.3.8           purrr_1.0.2          knitr_1.46          
[34] readr_2.1.5          ggplot2_3.5.0        dplyr_1.1.4         

loaded via a namespace (and not attached):
  [1] uuid_1.2-0         fastmatch_1.1-4    systemfonts_1.0.4 
  [4] plyr_1.8.9         lazyeval_0.2.2     splines_4.1.2     
  [7] websocket_1.4.1    crosstalk_1.2.1    usethis_2.1.5     
 [10] rncl_0.8.7         digest_0.6.35      foreach_1.5.2     
 [13] yulab.utils_0.1.4  ca_0.71.1          htmltools_0.5.8.1 
 [16] fansi_1.0.6        memoise_2.0.1      cluster_2.1.2     
 [19] remotes_2.5.0      tzdb_0.4.0         vroom_1.6.5       
 [22] svglite_2.1.3      timechange_0.3.0   prettyunits_1.2.0 
 [25] colorspace_2.1-0   xfun_0.43          crayon_1.5.2      
 [28] jsonlite_1.8.8     phylobase_0.8.12   iterators_1.0.14  
 [31] glue_1.7.0         registry_0.5-1     gtable_0.3.4      
 [34] webshot_0.5.5      seqinr_4.2-36      polysat_1.7-7     
 [37] pkgbuild_1.4.4     scales_1.3.0       miniUI_0.1.1.1    
 [40] Rcpp_1.0.12        xtable_1.8-4       progress_1.2.3    
 [43] gridGraphics_0.5-1 tidytree_0.4.6     bit_4.0.5         
 [46] DT_0.32            htmlwidgets_1.6.4  httr_1.4.7        
 [49] RColorBrewer_1.1-3 ellipsis_0.3.2     farver_2.1.1      
 [52] pkgconfig_2.0.3    XML_3.99-0.16.1    sass_0.4.9        
 [55] chromote_0.2.0     utf8_1.2.4         crul_1.4.0        
 [58] labeling_0.4.3     ggplotify_0.1.2    tidyselect_1.2.1  
 [61] rlang_1.1.3        reshape2_1.4.4     later_1.3.2       
 [64] munsell_0.5.1      tools_4.1.2        cachem_1.0.8      
 [67] cli_3.6.2          generics_0.1.3     devtools_2.4.3    
 [70] evaluate_0.23      fastmap_1.1.1      bit64_4.0.5       
 [73] processx_3.8.4     fs_1.6.3           dendextend_1.17.1 
 [76] nlme_3.1-155       mime_0.12          aplot_0.2.2       
 [79] xml2_1.3.6         compiler_4.1.2     rstudioapi_0.16.0 
 [82] curl_5.2.1         treeio_1.18.1      bslib_0.7.0       
 [85] RNeXML_2.4.11      stringi_1.8.3      ps_1.7.6          
 [88] desc_1.4.3         lattice_0.20-45    Matrix_1.4-0      
 [91] vegan_2.6-4        permute_0.9-7      vctrs_0.6.5       
 [94] pillar_1.9.0       lifecycle_1.0.4    jquerylib_0.1.4   
 [97] data.table_1.15.4  seriation_1.5.4    httpuv_1.6.15     
[100] patchwork_1.2.0    R6_2.5.1           promises_1.2.1    
[103] TSP_1.2-4          gridExtra_2.3      sessioninfo_1.2.2 
[106] codetools_0.2-18   pkgload_1.3.4      boot_1.3-28       
[109] MASS_7.3-55        assertthat_0.2.1   rprojroot_2.0.4   
[112] httpcode_0.3.0     withr_3.0.0        pegas_1.3         
[115] mgcv_1.8-39        parallel_4.1.2     hms_1.1.3         
[118] quadprog_1.5-8     grid_4.1.2         ggfun_0.1.4       
[121] rmarkdown_2.26     base64enc_0.1-3    shiny_1.8.1

References

Andrews, Simon et al. 2010. “FastQC: A Quality Control Tool for High Throughput Sequence Data.” Cambridge, United Kingdom.

Bayliss, Sion C, Harry A Thorpe, Nicola M Coyle, Samuel K Sheppard, and Edward J Feil. 2019. “PIRATE: A Fast and Scalable Pangenomics Toolbox for Clustering Diverged Orthologues in Bacteria.” Gigascience 8 (10): giz119.

Brown, C Titus, and Luiz Irber. 2016. “Sourmash: A Library for MinHash Sketching of DNA.” Journal of Open Source Software 1 (5): 27.

Bushnell, Brian. 2014. “BBMap: A Fast, Accurate, Splice-Aware Aligner.”

Chen, Shifu. 2023. “Ultrafast One-Pass FASTQ Data Preprocessing, Quality Control, and Deduplication Using Fastp.” Imeta 2 (2): e107.

Crusoe, Michael R, Hussien F Alameldin, Sherine Awad, Elmar Boucher, Adam Caldwell, Reed Cartwright, Amanda Charbonneau, et al. 2015. “The Khmer Software Package: Enabling Efficient Nucleotide Sequence Analysis.” F1000Research 4.

Danecek, Petr, Adam Auton, Goncalo Abecasis, Cornelis A Albers, Eric Banks, Mark A DePristo, Robert E Handsaker, et al. 2011. “The Variant Call Format and VCFtools.” Bioinformatics 27 (15): 2156–58.

Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. “Twelve Years of SAMtools and BCFtools.” Gigascience 10 (2): giab008.

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. “Nextflow Enables Reproducible Computational Workflows.” Nature Biotechnology 35 (4): 316–19.

Distribution, Anaconda Software. 2016. “Computer Software.” Vers. 4: 2–2.

Eggertsson, Hannes P, Hakon Jonsson, Snaedis Kristmundsdottir, Eirikur Hjartarson, Birte Kehr, Gisli Masson, Florian Zink, et al. 2017. “Graphtyper Enables Population-Scale Genotyping Using Pangenome Graphs.” Nature Genetics 49 (11): 1654–60.

Garrison, Erik, Zev N Kronenberg, Eric T Dawson, Brent S Pedersen, and Pjotr Prins. 2022. “A Spectrum of Free Software Tools for Processing the VCF Variant Call Format: Vcflib, Bio-Vcf, Cyvcf2, Hts-Nim and Slivar.” PLoS Computational Biology 18 (5): e1009123.

Katoh, Kazutaka, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. 2002. “MAFFT: A Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform.” Nucleic Acids Research 30 (14): 3059–66.

Kurtzer, Gregory M, Vanessa Sochat, and Michael W Bauer. 2017. “Singularity: Scientific Containers for Mobility of Compute.” PloS One 12 (5): e0177459.

Li, Heng. 2011. “Tabix: Fast Retrieval of Sequence Features from Generic TAB-Delimited Files.” Bioinformatics 27 (5): 718–19.

Li, Heng, and Richard Durbin. 2009. “Fast and Accurate Short Read Alignment with Burrows–Wheeler Transform.” Bioinformatics 25 (14): 1754–60.

Mikheenko, Alla, Andrey Prjibelski, Vladislav Saveliev, Dmitry Antipov, and Alexey Gurevich. 2018. “Versatile Genome Assembly Evaluation with QUAST-LG.” Bioinformatics 34 (13): i142–50.

Nguyen, Lam-Tung, Heiko A Schmidt, Arndt Von Haeseler, and Bui Quang Minh. 2015. “IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies.” Molecular Biology and Evolution 32 (1): 268–74.

“Picard Toolkit.” 2019. Broad Institute, GitHub Repository. https://broadinstitute.github.io/picard/; Broad Institute.

Prjibelski, Andrey, Dmitry Antipov, Dmitry Meleshko, Alla Lapidus, and Anton Korobeynikov. 2020. “Using SPAdes de Novo Assembler.” Current Protocols in Bioinformatics 70 (1): e102.

Qin, Qi-Long, Bin-Bin Xie, Xi-Ying Zhang, Xiu-Lan Chen, Bai-Cheng Zhou, Jizhong Zhou, Aharon Oren, and Yu-Zhong Zhang. 2014. “A Proposed Genus Boundary for the Prokaryotes Based on Genomic Insights.” Journal of Bacteriology 196 (12): 2210–15.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sayers, Eric W, Evan E Bolton, J Rodney Brister, Kathi Canese, Jessica Chan, Donald C Comeau, Ryan Connor, et al. 2022. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 50 (D1): D20.

Schwengers, Oliver, Lukas Jelonek, Marius Alfred Dieckmann, Sebastian Beyvers, Jochen Blom, and Alexander Goesmann. 2021. “Bakta: Rapid and Standardized Annotation of Bacterial Genomes via Alignment-Free Sequence Identification.” Microbial Genomics 7 (11): 000685.

Shen, Wei, Shuai Le, Yan Li, and Fuquan Hu. 2016. “SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/q File Manipulation.” PloS One 11 (10): e0163962.

Van der Auwera, Geraldine A, and Brian D O’Connor. 2020. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. O’Reilly Media.

About

The nf-core/pathogen surveillance pipeline was developed by: Zach Foster, Martha Sudermann, Camilo Parada-Rojas, Fernanda Iruegas-Bocardo, Ricardo Alcalá-Briseño, Jeff Chang and Nik Grunwald.

Other contributors include: Alex Weisberg, …

Feedback

To contribute, provide feedback, or report bugs please visit our github repository.

Please cite this pipeline and nf-core in publications as follows:

Foster et al. 2024. PathogenSurveillance: A nf-core pipeline for rapid analysis of pathogen genome data. In preparation.

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow Enables Reproducible Computational Workflows. Nature Biotechnology 35 (4): 316–19. https://doi.org/10.1038/nbt.3820.

Icons for this report were sampled from Bootstrap Icons, Freepick, Academicons, and Font Awesome.