Cargando…

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses

High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically...

Descripción completa

Detalles Bibliográficos
Autores principales: Alam, Md. Nafis Ul, Chowdhury, Umar Faruq
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/
https://www.ncbi.nlm.nih.gov/pubmed/32946529
http://dx.doi.org/10.1371/journal.pone.0239381
_version_ 1783583903975473152
author Alam, Md. Nafis Ul
Chowdhury, Umar Faruq
author_facet Alam, Md. Nafis Ul
Chowdhury, Umar Faruq
author_sort Alam, Md. Nafis Ul
collection PubMed
description High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.
format Online
Article
Text
id pubmed-7500682
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-75006822020-09-24 Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses Alam, Md. Nafis Ul Chowdhury, Umar Faruq PLoS One Research Article High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. Public Library of Science 2020-09-18 /pmc/articles/PMC7500682/ /pubmed/32946529 http://dx.doi.org/10.1371/journal.pone.0239381 Text en © 2020 Alam, Chowdhury http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Alam, Md. Nafis Ul
Chowdhury, Umar Faruq
Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_full Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_fullStr Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_full_unstemmed Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_short Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_sort short k-mer abundance profiles yield robust machine learning features and accurate classifiers for rna viruses
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/
https://www.ncbi.nlm.nih.gov/pubmed/32946529
http://dx.doi.org/10.1371/journal.pone.0239381
work_keys_str_mv AT alammdnafisul shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses
AT chowdhuryumarfaruq shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses