Cargando…

ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses

Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species pos...

Descripción completa

Detalles Bibliográficos
Autores principales: Pavlovikj, Natasha, Gomes-Neto, Joao Carlos, Deogun, Jitender S., Benson, Andrew K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8142932/
https://www.ncbi.nlm.nih.gov/pubmed/34055480
http://dx.doi.org/10.7717/peerj.11376
_version_ 1783696650625089536
author Pavlovikj, Natasha
Gomes-Neto, Joao Carlos
Deogun, Jitender S.
Benson, Andrew K.
author_facet Pavlovikj, Natasha
Gomes-Neto, Joao Carlos
Deogun, Jitender S.
Benson, Andrew K.
author_sort Pavlovikj, Natasha
collection PubMed
description Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Flexibility, scalability, and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, reproducible, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: (1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; (2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; (3) Use of high-performance and high-throughput computational platforms; (4) Generation of hierarchical-based population structure analysis based on combinations of multi-locus and Bayesian statistical approaches for classification for ecological and epidemiological inquiries; (5) Association of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases with the hierarchically-related genotypic classifications; and (6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis such as identification of population-specific genomic signatures. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species, and the Pegasus WMS uniquely facilitates addition or removal of programs from the workflow or modification of options within them. To demonstrate versatility of the ProkEvo platform, we performed a hierarchical-based population structure analyses from available genomes of three distinct pathogenic bacterial species as individual case studies. The specific case studies illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be integrated into an analysis. Collectively, our study shows that ProkEvo presents a practical viable option for scalable, automated analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.
format Online
Article
Text
id pubmed-8142932
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-81429322021-05-28 ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses Pavlovikj, Natasha Gomes-Neto, Joao Carlos Deogun, Jitender S. Benson, Andrew K. PeerJ Bioinformatics Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Flexibility, scalability, and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, reproducible, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: (1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; (2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; (3) Use of high-performance and high-throughput computational platforms; (4) Generation of hierarchical-based population structure analysis based on combinations of multi-locus and Bayesian statistical approaches for classification for ecological and epidemiological inquiries; (5) Association of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases with the hierarchically-related genotypic classifications; and (6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis such as identification of population-specific genomic signatures. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species, and the Pegasus WMS uniquely facilitates addition or removal of programs from the workflow or modification of options within them. To demonstrate versatility of the ProkEvo platform, we performed a hierarchical-based population structure analyses from available genomes of three distinct pathogenic bacterial species as individual case studies. The specific case studies illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be integrated into an analysis. Collectively, our study shows that ProkEvo presents a practical viable option for scalable, automated analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance. PeerJ Inc. 2021-05-21 /pmc/articles/PMC8142932/ /pubmed/34055480 http://dx.doi.org/10.7717/peerj.11376 Text en © 2021 Pavlovikj et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Pavlovikj, Natasha
Gomes-Neto, Joao Carlos
Deogun, Jitender S.
Benson, Andrew K.
ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
title ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
title_full ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
title_fullStr ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
title_full_unstemmed ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
title_short ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
title_sort prokevo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8142932/
https://www.ncbi.nlm.nih.gov/pubmed/34055480
http://dx.doi.org/10.7717/peerj.11376
work_keys_str_mv AT pavlovikjnatasha prokevoanautomatedreproducibleandscalableframeworkforhighthroughputbacterialpopulationgenomicsanalyses
AT gomesnetojoaocarlos prokevoanautomatedreproducibleandscalableframeworkforhighthroughputbacterialpopulationgenomicsanalyses
AT deogunjitenders prokevoanautomatedreproducibleandscalableframeworkforhighthroughputbacterialpopulationgenomicsanalyses
AT bensonandrewk prokevoanautomatedreproducibleandscalableframeworkforhighthroughputbacterialpopulationgenomicsanalyses