Cargando…

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic...

Descripción completa

Detalles Bibliográficos
Autores principales: Randhawa, Gurjit S., Soltysiak, Maximillian P. M., El Roz, Hadi, de Souza, Camila P. E., Hill, Kathleen A., Kari, Lila
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7182198/
https://www.ncbi.nlm.nih.gov/pubmed/32330208
http://dx.doi.org/10.1371/journal.pone.0232391
_version_ 1783526197986066432
author Randhawa, Gurjit S.
Soltysiak, Maximillian P. M.
El Roz, Hadi
de Souza, Camila P. E.
Hill, Kathleen A.
Kari, Lila
author_facet Randhawa, Gurjit S.
Soltysiak, Maximillian P. M.
El Roz, Hadi
de Souza, Camila P. E.
Hill, Kathleen A.
Kari, Lila
author_sort Randhawa, Gurjit S.
collection PubMed
description The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.
format Online
Article
Text
id pubmed-7182198
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-71821982020-05-05 Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study Randhawa, Gurjit S. Soltysiak, Maximillian P. M. El Roz, Hadi de Souza, Camila P. E. Hill, Kathleen A. Kari, Lila PLoS One Research Article The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification. Public Library of Science 2020-04-24 /pmc/articles/PMC7182198/ /pubmed/32330208 http://dx.doi.org/10.1371/journal.pone.0232391 Text en © 2020 Randhawa et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Randhawa, Gurjit S.
Soltysiak, Maximillian P. M.
El Roz, Hadi
de Souza, Camila P. E.
Hill, Kathleen A.
Kari, Lila
Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
title Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
title_full Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
title_fullStr Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
title_full_unstemmed Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
title_short Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
title_sort machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: covid-19 case study
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7182198/
https://www.ncbi.nlm.nih.gov/pubmed/32330208
http://dx.doi.org/10.1371/journal.pone.0232391
work_keys_str_mv AT randhawagurjits machinelearningusingintrinsicgenomicsignaturesforrapidclassificationofnovelpathogenscovid19casestudy
AT soltysiakmaximillianpm machinelearningusingintrinsicgenomicsignaturesforrapidclassificationofnovelpathogenscovid19casestudy
AT elrozhadi machinelearningusingintrinsicgenomicsignaturesforrapidclassificationofnovelpathogenscovid19casestudy
AT desouzacamilape machinelearningusingintrinsicgenomicsignaturesforrapidclassificationofnovelpathogenscovid19casestudy
AT hillkathleena machinelearningusingintrinsicgenomicsignaturesforrapidclassificationofnovelpathogenscovid19casestudy
AT karilila machinelearningusingintrinsicgenomicsignaturesforrapidclassificationofnovelpathogenscovid19casestudy