Cargando…

MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins

Here we present MARVEL, a tool for prediction of double-stranded DNA bacteriophage sequences in metagenomic bins. MARVEL uses a random forest machine learning approach. We trained the program on a dataset with 1,247 phage and 1,029 bacterial genomes, and tested it on a dataset with 335 bacterial and...

Descripción completa

Detalles Bibliográficos
Autores principales: Amgarten, Deyvid, Braga, Lucas P. P., da Silva, Aline M., Setubal, João C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6090037/
https://www.ncbi.nlm.nih.gov/pubmed/30131825
http://dx.doi.org/10.3389/fgene.2018.00304
_version_ 1783347127357800448
author Amgarten, Deyvid
Braga, Lucas P. P.
da Silva, Aline M.
Setubal, João C.
author_facet Amgarten, Deyvid
Braga, Lucas P. P.
da Silva, Aline M.
Setubal, João C.
author_sort Amgarten, Deyvid
collection PubMed
description Here we present MARVEL, a tool for prediction of double-stranded DNA bacteriophage sequences in metagenomic bins. MARVEL uses a random forest machine learning approach. We trained the program on a dataset with 1,247 phage and 1,029 bacterial genomes, and tested it on a dataset with 335 bacterial and 177 phage genomes. We show that three simple genomic features extracted from contig sequences were sufficient to achieve a good performance in separating bacterial from phage sequences: gene density, strand shifts, and fraction of significant hits to a viral protein database. We compared the performance of MARVEL to that of VirSorter and VirFinder, two popular programs for predicting viral sequences. Our results show that all three programs have comparable specificity, but MARVEL achieves much better performance on the recall (sensitivity) measure. This means that MARVEL should be able to identify many more phage sequences in metagenomic bins than heretofore has been possible. In a simple test with real data, containing mostly bacterial sequences, MARVEL classified 58 out of 209 bins as phage genomes; other evidence suggests that 57 of these 58 bins are novel phage sequences. MARVEL is freely available at https://github.com/LaboratorioBioinformatica/MARVEL.
format Online
Article
Text
id pubmed-6090037
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-60900372018-08-21 MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins Amgarten, Deyvid Braga, Lucas P. P. da Silva, Aline M. Setubal, João C. Front Genet Genetics Here we present MARVEL, a tool for prediction of double-stranded DNA bacteriophage sequences in metagenomic bins. MARVEL uses a random forest machine learning approach. We trained the program on a dataset with 1,247 phage and 1,029 bacterial genomes, and tested it on a dataset with 335 bacterial and 177 phage genomes. We show that three simple genomic features extracted from contig sequences were sufficient to achieve a good performance in separating bacterial from phage sequences: gene density, strand shifts, and fraction of significant hits to a viral protein database. We compared the performance of MARVEL to that of VirSorter and VirFinder, two popular programs for predicting viral sequences. Our results show that all three programs have comparable specificity, but MARVEL achieves much better performance on the recall (sensitivity) measure. This means that MARVEL should be able to identify many more phage sequences in metagenomic bins than heretofore has been possible. In a simple test with real data, containing mostly bacterial sequences, MARVEL classified 58 out of 209 bins as phage genomes; other evidence suggests that 57 of these 58 bins are novel phage sequences. MARVEL is freely available at https://github.com/LaboratorioBioinformatica/MARVEL. Frontiers Media S.A. 2018-08-07 /pmc/articles/PMC6090037/ /pubmed/30131825 http://dx.doi.org/10.3389/fgene.2018.00304 Text en Copyright © 2018 Amgarten, Braga, da Silva and Setubal. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Amgarten, Deyvid
Braga, Lucas P. P.
da Silva, Aline M.
Setubal, João C.
MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins
title MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins
title_full MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins
title_fullStr MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins
title_full_unstemmed MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins
title_short MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins
title_sort marvel, a tool for prediction of bacteriophage sequences in metagenomic bins
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6090037/
https://www.ncbi.nlm.nih.gov/pubmed/30131825
http://dx.doi.org/10.3389/fgene.2018.00304
work_keys_str_mv AT amgartendeyvid marvelatoolforpredictionofbacteriophagesequencesinmetagenomicbins
AT bragalucaspp marvelatoolforpredictionofbacteriophagesequencesinmetagenomicbins
AT dasilvaalinem marvelatoolforpredictionofbacteriophagesequencesinmetagenomicbins
AT setubaljoaoc marvelatoolforpredictionofbacteriophagesequencesinmetagenomicbins