Cargando…

A multi-task CNN learning model for taxonomic assignment of human viruses

BACKGROUND: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the refe...

Descripción completa

Detalles Bibliográficos
Autores principales: Ma, Haoran, Tan, Tin Wee, Ban, Kenneth Hon Kim
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8170063/
https://www.ncbi.nlm.nih.gov/pubmed/34078269
http://dx.doi.org/10.1186/s12859-021-04084-w
_version_ 1783702157484097536
author Ma, Haoran
Tan, Tin Wee
Ban, Kenneth Hon Kim
author_facet Ma, Haoran
Tan, Tin Wee
Ban, Kenneth Hon Kim
author_sort Ma, Haoran
collection PubMed
description BACKGROUND: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. RESULTS: In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. CONCLUSIONS: We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04084-w.
format Online
Article
Text
id pubmed-8170063
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-81700632021-06-02 A multi-task CNN learning model for taxonomic assignment of human viruses Ma, Haoran Tan, Tin Wee Ban, Kenneth Hon Kim BMC Bioinformatics Software BACKGROUND: Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. RESULTS: In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. CONCLUSIONS: We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04084-w. BioMed Central 2021-06-02 /pmc/articles/PMC8170063/ /pubmed/34078269 http://dx.doi.org/10.1186/s12859-021-04084-w Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Ma, Haoran
Tan, Tin Wee
Ban, Kenneth Hon Kim
A multi-task CNN learning model for taxonomic assignment of human viruses
title A multi-task CNN learning model for taxonomic assignment of human viruses
title_full A multi-task CNN learning model for taxonomic assignment of human viruses
title_fullStr A multi-task CNN learning model for taxonomic assignment of human viruses
title_full_unstemmed A multi-task CNN learning model for taxonomic assignment of human viruses
title_short A multi-task CNN learning model for taxonomic assignment of human viruses
title_sort multi-task cnn learning model for taxonomic assignment of human viruses
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8170063/
https://www.ncbi.nlm.nih.gov/pubmed/34078269
http://dx.doi.org/10.1186/s12859-021-04084-w
work_keys_str_mv AT mahaoran amultitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses
AT tantinwee amultitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses
AT bankennethhonkim amultitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses
AT mahaoran multitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses
AT tantinwee multitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses
AT bankennethhonkim multitaskcnnlearningmodelfortaxonomicassignmentofhumanviruses