Cargando…

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses

High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically...

Descripción completa

Detalles Bibliográficos
Autores principales:	Alam, Md. Nafis Ul, Chowdhury, Umar Faruq
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/ https://www.ncbi.nlm.nih.gov/pubmed/32946529 http://dx.doi.org/10.1371/journal.pone.0239381

_version_	1783583903975473152
author	Alam, Md. Nafis Ul Chowdhury, Umar Faruq
author_facet	Alam, Md. Nafis Ul Chowdhury, Umar Faruq
author_sort	Alam, Md. Nafis Ul
collection	PubMed
description	High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.
format	Online Article Text
id	pubmed-7500682
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-75006822020-09-24 Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses Alam, Md. Nafis Ul Chowdhury, Umar Faruq PLoS One Research Article High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data. Public Library of Science 2020-09-18 /pmc/articles/PMC7500682/ /pubmed/32946529 http://dx.doi.org/10.1371/journal.pone.0239381 Text en © 2020 Alam, Chowdhury http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Alam, Md. Nafis Ul Chowdhury, Umar Faruq Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title	Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_full	Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_fullStr	Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_full_unstemmed	Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_short	Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_sort	short k-mer abundance profiles yield robust machine learning features and accurate classifiers for rna viruses
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/ https://www.ncbi.nlm.nih.gov/pubmed/32946529 http://dx.doi.org/10.1371/journal.pone.0239381
work_keys_str_mv	AT alammdnafisul shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses AT chowdhuryumarfaruq shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses

Ejemplares similares