Cargando…
Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/ https://www.ncbi.nlm.nih.gov/pubmed/32946529 http://dx.doi.org/10.1371/journal.pone.0239381 |
_version_ | 1783583903975473152 |
---|---|
author | Alam, Md. Nafis Ul Chowdhury, Umar Faruq |
author_facet | Alam, Md. Nafis Ul Chowdhury, Umar Faruq |
author_sort | Alam, Md. Nafis Ul |
collection | PubMed |
description | High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. |
format | Online Article Text |
id | pubmed-7500682 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-75006822020-09-24 Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses Alam, Md. Nafis Ul Chowdhury, Umar Faruq PLoS One Research Article High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. Public Library of Science 2020-09-18 /pmc/articles/PMC7500682/ /pubmed/32946529 http://dx.doi.org/10.1371/journal.pone.0239381 Text en © 2020 Alam, Chowdhury http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Alam, Md. Nafis Ul Chowdhury, Umar Faruq Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses |
title | Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses |
title_full | Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses |
title_fullStr | Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses |
title_full_unstemmed | Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses |
title_short | Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses |
title_sort | short k-mer abundance profiles yield robust machine learning features and accurate classifiers for rna viruses |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/ https://www.ncbi.nlm.nih.gov/pubmed/32946529 http://dx.doi.org/10.1371/journal.pone.0239381 |
work_keys_str_mv | AT alammdnafisul shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses AT chowdhuryumarfaruq shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses |