Cargando…
A multi-task CNN learning model for taxonomic assignment of human viruses
BACKGROUND: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the refe...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8170063/ https://www.ncbi.nlm.nih.gov/pubmed/34078269 http://dx.doi.org/10.1186/s12859-021-04084-w |
_version_ | 1783702157484097536 |
---|---|
author | Ma, Haoran Tan, Tin Wee Ban, Kenneth Hon Kim |
author_facet | Ma, Haoran Tan, Tin Wee Ban, Kenneth Hon Kim |
author_sort | Ma, Haoran |
collection | PubMed |
description | BACKGROUND: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. RESULTS: In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. CONCLUSIONS: We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04084-w. |
format | Online Article Text |
id | pubmed-8170063 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-81700632021-06-02 A multi-task CNN learning model for taxonomic assignment of human viruses Ma, Haoran Tan, Tin Wee Ban, Kenneth Hon Kim BMC Bioinformatics Software BACKGROUND: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. RESULTS: In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. CONCLUSIONS: We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04084-w. BioMed Central 2021-06-02 /pmc/articles/PMC8170063/ /pubmed/34078269 http://dx.doi.org/10.1186/s12859-021-04084-w Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Ma, Haoran Tan, Tin Wee Ban, Kenneth Hon Kim A multi-task CNN learning model for taxonomic assignment of human viruses |
title | A multi-task CNN learning model for taxonomic assignment of human viruses |
title_full | A multi-task CNN learning model for taxonomic assignment of human viruses |
title_fullStr | A multi-task CNN learning model for taxonomic assignment of human viruses |
title_full_unstemmed | A multi-task CNN learning model for taxonomic assignment of human viruses |
title_short | A multi-task CNN learning model for taxonomic assignment of human viruses |
title_sort | multi-task cnn learning model for taxonomic assignment of human viruses |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8170063/ https://www.ncbi.nlm.nih.gov/pubmed/34078269 http://dx.doi.org/10.1186/s12859-021-04084-w |
work_keys_str_mv | AT mahaoran amultitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses AT tantinwee amultitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses AT bankennethhonkim amultitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses AT mahaoran multitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses AT tantinwee multitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses AT bankennethhonkim multitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses |