Cargando…

A machine learning approach for viral genome classification

BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification...

Descripción completa

Detalles Bibliográficos
Autores principales: Remita, Mohamed Amine, Halioui, Ahmed, Malick Diouara, Abou Abdallah, Daigle, Bruno, Kiani, Golrokh, Diallo, Abdoulaye Baniré
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5387389/
https://www.ncbi.nlm.nih.gov/pubmed/28399797
http://dx.doi.org/10.1186/s12859-017-1602-3
_version_ 1782520938403725312
author Remita, Mohamed Amine
Halioui, Ahmed
Malick Diouara, Abou Abdallah
Daigle, Bruno
Kiani, Golrokh
Diallo, Abdoulaye Baniré
author_facet Remita, Mohamed Amine
Halioui, Ahmed
Malick Diouara, Abou Abdallah
Daigle, Bruno
Kiani, Golrokh
Diallo, Abdoulaye Baniré
author_sort Remita, Mohamed Amine
collection PubMed
description BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families. RESULTS: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments. CONCLUSION: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1602-3) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5387389
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53873892017-04-14 A machine learning approach for viral genome classification Remita, Mohamed Amine Halioui, Ahmed Malick Diouara, Abou Abdallah Daigle, Bruno Kiani, Golrokh Diallo, Abdoulaye Baniré BMC Bioinformatics Research Article BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families. RESULTS: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments. CONCLUSION: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1602-3) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-11 /pmc/articles/PMC5387389/ /pubmed/28399797 http://dx.doi.org/10.1186/s12859-017-1602-3 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Remita, Mohamed Amine
Halioui, Ahmed
Malick Diouara, Abou Abdallah
Daigle, Bruno
Kiani, Golrokh
Diallo, Abdoulaye Baniré
A machine learning approach for viral genome classification
title A machine learning approach for viral genome classification
title_full A machine learning approach for viral genome classification
title_fullStr A machine learning approach for viral genome classification
title_full_unstemmed A machine learning approach for viral genome classification
title_short A machine learning approach for viral genome classification
title_sort machine learning approach for viral genome classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5387389/
https://www.ncbi.nlm.nih.gov/pubmed/28399797
http://dx.doi.org/10.1186/s12859-017-1602-3
work_keys_str_mv AT remitamohamedamine amachinelearningapproachforviralgenomeclassification
AT haliouiahmed amachinelearningapproachforviralgenomeclassification
AT malickdiouaraabouabdallah amachinelearningapproachforviralgenomeclassification
AT daiglebruno amachinelearningapproachforviralgenomeclassification
AT kianigolrokh amachinelearningapproachforviralgenomeclassification
AT dialloabdoulayebanire amachinelearningapproachforviralgenomeclassification
AT remitamohamedamine machinelearningapproachforviralgenomeclassification
AT haliouiahmed machinelearningapproachforviralgenomeclassification
AT malickdiouaraabouabdallah machinelearningapproachforviralgenomeclassification
AT daiglebruno machinelearningapproachforviralgenomeclassification
AT kianigolrokh machinelearningapproachforviralgenomeclassification
AT dialloabdoulayebanire machinelearningapproachforviralgenomeclassification