Cargando…
A machine learning approach for viral genome classification
BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5387389/ https://www.ncbi.nlm.nih.gov/pubmed/28399797 http://dx.doi.org/10.1186/s12859-017-1602-3 |
_version_ | 1782520938403725312 |
---|---|
author | Remita, Mohamed Amine Halioui, Ahmed Malick Diouara, Abou Abdallah Daigle, Bruno Kiani, Golrokh Diallo, Abdoulaye Baniré |
author_facet | Remita, Mohamed Amine Halioui, Ahmed Malick Diouara, Abou Abdallah Daigle, Bruno Kiani, Golrokh Diallo, Abdoulaye Baniré |
author_sort | Remita, Mohamed Amine |
collection | PubMed |
description | BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families. RESULTS: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments. CONCLUSION: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1602-3) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5387389 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-53873892017-04-14 A machine learning approach for viral genome classification Remita, Mohamed Amine Halioui, Ahmed Malick Diouara, Abou Abdallah Daigle, Bruno Kiani, Golrokh Diallo, Abdoulaye Baniré BMC Bioinformatics Research Article BACKGROUND: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families. RESULTS: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments. CONCLUSION: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1602-3) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-11 /pmc/articles/PMC5387389/ /pubmed/28399797 http://dx.doi.org/10.1186/s12859-017-1602-3 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Remita, Mohamed Amine Halioui, Ahmed Malick Diouara, Abou Abdallah Daigle, Bruno Kiani, Golrokh Diallo, Abdoulaye Baniré A machine learning approach for viral genome classification |
title | A machine learning approach for viral genome classification |
title_full | A machine learning approach for viral genome classification |
title_fullStr | A machine learning approach for viral genome classification |
title_full_unstemmed | A machine learning approach for viral genome classification |
title_short | A machine learning approach for viral genome classification |
title_sort | machine learning approach for viral genome classification |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5387389/ https://www.ncbi.nlm.nih.gov/pubmed/28399797 http://dx.doi.org/10.1186/s12859-017-1602-3 |
work_keys_str_mv | AT remitamohamedamine amachinelearningapproachforviralgenomeclassification AT haliouiahmed amachinelearningapproachforviralgenomeclassification AT malickdiouaraabouabdallah amachinelearningapproachforviralgenomeclassification AT daiglebruno amachinelearningapproachforviralgenomeclassification AT kianigolrokh amachinelearningapproachforviralgenomeclassification AT dialloabdoulayebanire amachinelearningapproachforviralgenomeclassification AT remitamohamedamine machinelearningapproachforviralgenomeclassification AT haliouiahmed machinelearningapproachforviralgenomeclassification AT malickdiouaraabouabdallah machinelearningapproachforviralgenomeclassification AT daiglebruno machinelearningapproachforviralgenomeclassification AT kianigolrokh machinelearningapproachforviralgenomeclassification AT dialloabdoulayebanire machinelearningapproachforviralgenomeclassification |