Cargando…

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data

BACKGROUND: Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approache...

Descripción completa

Detalles Bibliográficos
Autores principales: Ren, Jie, Ahlgren, Nathan A., Lu, Yang Young, Fuhrman, Jed A., Sun, Fengzhu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5501583/
https://www.ncbi.nlm.nih.gov/pubmed/28683828
http://dx.doi.org/10.1186/s40168-017-0283-5
_version_ 1783248812965363712
author Ren, Jie
Ahlgren, Nathan A.
Lu, Yang Young
Fuhrman, Jed A.
Sun, Fengzhu
author_facet Ren, Jie
Ahlgren, Nathan A.
Lu, Yang Young
Fuhrman, Jed A.
Sun, Fengzhu
author_sort Ren, Jie
collection PubMed
description BACKGROUND: Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses. METHODS: We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder’s performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014. RESULTS: VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder’s potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients. CONCLUSIONS: This innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s40168-017-0283-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5501583
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-55015832017-07-10 VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data Ren, Jie Ahlgren, Nathan A. Lu, Yang Young Fuhrman, Jed A. Sun, Fengzhu Microbiome Methodology BACKGROUND: Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses. METHODS: We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder’s performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014. RESULTS: VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder’s potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients. CONCLUSIONS: This innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s40168-017-0283-5) contains supplementary material, which is available to authorized users. BioMed Central 2017-07-06 /pmc/articles/PMC5501583/ /pubmed/28683828 http://dx.doi.org/10.1186/s40168-017-0283-5 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Ren, Jie
Ahlgren, Nathan A.
Lu, Yang Young
Fuhrman, Jed A.
Sun, Fengzhu
VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
title VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
title_full VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
title_fullStr VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
title_full_unstemmed VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
title_short VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
title_sort virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5501583/
https://www.ncbi.nlm.nih.gov/pubmed/28683828
http://dx.doi.org/10.1186/s40168-017-0283-5
work_keys_str_mv AT renjie virfinderanovelkmerbasedtoolforidentifyingviralsequencesfromassembledmetagenomicdata
AT ahlgrennathana virfinderanovelkmerbasedtoolforidentifyingviralsequencesfromassembledmetagenomicdata
AT luyangyoung virfinderanovelkmerbasedtoolforidentifyingviralsequencesfromassembledmetagenomicdata
AT fuhrmanjeda virfinderanovelkmerbasedtoolforidentifyingviralsequencesfromassembledmetagenomicdata
AT sunfengzhu virfinderanovelkmerbasedtoolforidentifyingviralsequencesfromassembledmetagenomicdata