Cargando…

Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis

Widespread in public databases, foreign contaminant sequences pose a substantial obstacle in genomic analyses. Such contamination in viral genome databases is also notorious but more complicated and often causes questionable results in various applications, particularly in virome-based virus detecti...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Junjie, Sun, Yue, Yan, Xiaomin, Ren, Zilin, Wang, Guoshuai, Liu, Yuhang, Zhao, Zihan, Yi, Le, Tu, Changchun, He, Biao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society for Microbiology 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9765019/
https://www.ncbi.nlm.nih.gov/pubmed/36286492
http://dx.doi.org/10.1128/msystems.00907-22
_version_ 1784853394652921856
author Chen, Junjie
Sun, Yue
Yan, Xiaomin
Ren, Zilin
Wang, Guoshuai
Liu, Yuhang
Zhao, Zihan
Yi, Le
Tu, Changchun
He, Biao
author_facet Chen, Junjie
Sun, Yue
Yan, Xiaomin
Ren, Zilin
Wang, Guoshuai
Liu, Yuhang
Zhao, Zihan
Yi, Le
Tu, Changchun
He, Biao
author_sort Chen, Junjie
collection PubMed
description Widespread in public databases, foreign contaminant sequences pose a substantial obstacle in genomic analyses. Such contamination in viral genome databases is also notorious but more complicated and often causes questionable results in various applications, particularly in virome-based virus detection. Here, we conducted comprehensive screening and identification of the foreign sequences hidden in the largest eukaryotic viral genome collections of GenBank and UniProt using a scrutiny pipeline, which enables us to rigorously detect those problematic viral sequences (PVSs) with origins in hosts, vectors, and laboratory components. As a result, a total of 766 nucleotide PVSs and 276 amino acid PVSs with lengths up to 6,605 bp were determined, which were widely distributed in 39 families with many involving highly public health-concerning viruses, such as hepatitis C virus, Crimean-Congo hemorrhagic fever virus, and filovirus. The majority of these PVSs are genomic fragments of hosts including humans and bacteria. However, they cannot simply be regarded as foreign contaminants, since parts of them are results of natural occurrence or artificial engineering of viruses. Nevertheless, they severely disturb such sequence-based analyses as genome annotation, taxonomic assignment, and virome profiling. Therefore, we provide a clean version of the eukaryotic viral reference data set by the removal of these PVSs, which allows more accurate virome analysis with less time consumed than with other comprehensive databases. IMPORTANCE High-throughput sequencing-based viromics highly depends on reference databases, but foreign contamination is widespread in public databases and often leads to confusing and even wrong conclusions in genomic analysis and viromic profiling. To address this issue, we systematically detected and identified the contamination in the largest viral sequence collections of GenBank and UniProt based on a stringent scrutiny pipeline. We found hundreds of PVSs that are related to hosts, vectors, and laboratory components. By the removal of them, the resulting data set greatly improves the accuracy and efficiency of eukaryotic virome profiling. These results refresh our knowledge of the type and origin of PVSs and also have warning implications for viromic analysis. Viromic practitioners should be aware of these problems caused by PVSs and need to realize that a careful review of bioinformatic results is necessary for a reliable conclusion.
format Online
Article
Text
id pubmed-9765019
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher American Society for Microbiology
record_format MEDLINE/PubMed
spelling pubmed-97650192022-12-21 Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis Chen, Junjie Sun, Yue Yan, Xiaomin Ren, Zilin Wang, Guoshuai Liu, Yuhang Zhao, Zihan Yi, Le Tu, Changchun He, Biao mSystems Research Article Widespread in public databases, foreign contaminant sequences pose a substantial obstacle in genomic analyses. Such contamination in viral genome databases is also notorious but more complicated and often causes questionable results in various applications, particularly in virome-based virus detection. Here, we conducted comprehensive screening and identification of the foreign sequences hidden in the largest eukaryotic viral genome collections of GenBank and UniProt using a scrutiny pipeline, which enables us to rigorously detect those problematic viral sequences (PVSs) with origins in hosts, vectors, and laboratory components. As a result, a total of 766 nucleotide PVSs and 276 amino acid PVSs with lengths up to 6,605 bp were determined, which were widely distributed in 39 families with many involving highly public health-concerning viruses, such as hepatitis C virus, Crimean-Congo hemorrhagic fever virus, and filovirus. The majority of these PVSs are genomic fragments of hosts including humans and bacteria. However, they cannot simply be regarded as foreign contaminants, since parts of them are results of natural occurrence or artificial engineering of viruses. Nevertheless, they severely disturb such sequence-based analyses as genome annotation, taxonomic assignment, and virome profiling. Therefore, we provide a clean version of the eukaryotic viral reference data set by the removal of these PVSs, which allows more accurate virome analysis with less time consumed than with other comprehensive databases. IMPORTANCE High-throughput sequencing-based viromics highly depends on reference databases, but foreign contamination is widespread in public databases and often leads to confusing and even wrong conclusions in genomic analysis and viromic profiling. To address this issue, we systematically detected and identified the contamination in the largest viral sequence collections of GenBank and UniProt based on a stringent scrutiny pipeline. We found hundreds of PVSs that are related to hosts, vectors, and laboratory components. By the removal of them, the resulting data set greatly improves the accuracy and efficiency of eukaryotic virome profiling. These results refresh our knowledge of the type and origin of PVSs and also have warning implications for viromic analysis. Viromic practitioners should be aware of these problems caused by PVSs and need to realize that a careful review of bioinformatic results is necessary for a reliable conclusion. American Society for Microbiology 2022-10-26 /pmc/articles/PMC9765019/ /pubmed/36286492 http://dx.doi.org/10.1128/msystems.00907-22 Text en Copyright © 2022 Chen et al. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research Article
Chen, Junjie
Sun, Yue
Yan, Xiaomin
Ren, Zilin
Wang, Guoshuai
Liu, Yuhang
Zhao, Zihan
Yi, Le
Tu, Changchun
He, Biao
Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis
title Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis
title_full Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis
title_fullStr Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis
title_full_unstemmed Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis
title_short Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis
title_sort elimination of foreign sequences in eukaryotic viral reference genomes improves the accuracy of virome analysis
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9765019/
https://www.ncbi.nlm.nih.gov/pubmed/36286492
http://dx.doi.org/10.1128/msystems.00907-22
work_keys_str_mv AT chenjunjie eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT sunyue eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT yanxiaomin eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT renzilin eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT wangguoshuai eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT liuyuhang eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT zhaozihan eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT yile eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT tuchangchun eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis
AT hebiao eliminationofforeignsequencesineukaryoticviralreferencegenomesimprovestheaccuracyofviromeanalysis