Cargando…

Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae

BACKGROUND: Geminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host h...

Descripción completa

Detalles Bibliográficos
Autores principales: Silva, José Cleydson F., Carvalho, Thales F. M., Fontes, Elizabeth P. B., Cerqueira, Fabio R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5622471/
https://www.ncbi.nlm.nih.gov/pubmed/28964254
http://dx.doi.org/10.1186/s12859-017-1839-x
_version_ 1783267915186831360
author Silva, José Cleydson F.
Carvalho, Thales F. M.
Fontes, Elizabeth P. B.
Cerqueira, Fabio R.
author_facet Silva, José Cleydson F.
Carvalho, Thales F. M.
Fontes, Elizabeth P. B.
Cerqueira, Fabio R.
author_sort Silva, José Cleydson F.
collection PubMed
description BACKGROUND: Geminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host have greatly increased in recent years. Furthermore, the use of rolling circle amplification (RCA) and advanced metagenomics approaches have enabled the elucidation of viromes and the identification of many viral agents in a large number of plant species. As a result, determining the nomenclature and taxonomically classifying geminiviruses turned into complex tasks. In addition, the gene responsible for viral replication (particularly, the viruses belonging to the genus Mastrevirus) may be spliced due to the use of the transcriptional/splicing machinery in the host cells. However, the current tools have limitations concerning the identification of introns. RESULTS: This study proposes a new method, designated Fangorn Forest (F2), based on machine learning approaches to classify genera using an ab initio approach, i.e., using only the genomic sequence, as well as to predict and classify genes in the family Geminiviridae. In this investigation, nine genera of the family Geminiviridae and their related satellite DNAs were selected. We obtained two training sets, one for genus classification, containing attributes extracted from the complete genome of geminiviruses, while the other was made up to classify geminivirus genes, containing attributes extracted from ORFs taken from the complete genomes cited above. Three ML algorithms were applied on those datasets to build the predictive models: support vector machines, using the sequential minimal optimization training approach, random forest (RF), and multilayer perceptron. RF demonstrated a very high predictive power, achieving 0.966, 0.964, and 0.995 of precision, recall, and area under the curve (AUC), respectively, for genus classification. For gene classification, RF could reach 0.983, 0.983, and 0.998 of precision, recall, and AUC, respectively. CONCLUSIONS: Therefore, Fangorn Forest is proven to be an efficient method for classifying genera of the family Geminiviridae with high precision and effective gene prediction and classification. The method is freely accessible at www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-017-1839-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5622471
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-56224712017-10-11 Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae Silva, José Cleydson F. Carvalho, Thales F. M. Fontes, Elizabeth P. B. Cerqueira, Fabio R. BMC Bioinformatics Methodology Article BACKGROUND: Geminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host have greatly increased in recent years. Furthermore, the use of rolling circle amplification (RCA) and advanced metagenomics approaches have enabled the elucidation of viromes and the identification of many viral agents in a large number of plant species. As a result, determining the nomenclature and taxonomically classifying geminiviruses turned into complex tasks. In addition, the gene responsible for viral replication (particularly, the viruses belonging to the genus Mastrevirus) may be spliced due to the use of the transcriptional/splicing machinery in the host cells. However, the current tools have limitations concerning the identification of introns. RESULTS: This study proposes a new method, designated Fangorn Forest (F2), based on machine learning approaches to classify genera using an ab initio approach, i.e., using only the genomic sequence, as well as to predict and classify genes in the family Geminiviridae. In this investigation, nine genera of the family Geminiviridae and their related satellite DNAs were selected. We obtained two training sets, one for genus classification, containing attributes extracted from the complete genome of geminiviruses, while the other was made up to classify geminivirus genes, containing attributes extracted from ORFs taken from the complete genomes cited above. Three ML algorithms were applied on those datasets to build the predictive models: support vector machines, using the sequential minimal optimization training approach, random forest (RF), and multilayer perceptron. RF demonstrated a very high predictive power, achieving 0.966, 0.964, and 0.995 of precision, recall, and area under the curve (AUC), respectively, for genus classification. For gene classification, RF could reach 0.983, 0.983, and 0.998 of precision, recall, and AUC, respectively. CONCLUSIONS: Therefore, Fangorn Forest is proven to be an efficient method for classifying genera of the family Geminiviridae with high precision and effective gene prediction and classification. The method is freely accessible at www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-017-1839-x) contains supplementary material, which is available to authorized users. BioMed Central 2017-09-30 /pmc/articles/PMC5622471/ /pubmed/28964254 http://dx.doi.org/10.1186/s12859-017-1839-x Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Silva, José Cleydson F.
Carvalho, Thales F. M.
Fontes, Elizabeth P. B.
Cerqueira, Fabio R.
Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae
title Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae
title_full Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae
title_fullStr Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae
title_full_unstemmed Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae
title_short Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae
title_sort fangorn forest (f2): a machine learning approach to classify genes and genera in the family geminiviridae
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5622471/
https://www.ncbi.nlm.nih.gov/pubmed/28964254
http://dx.doi.org/10.1186/s12859-017-1839-x
work_keys_str_mv AT silvajosecleydsonf fangornforestf2amachinelearningapproachtoclassifygenesandgenerainthefamilygeminiviridae
AT carvalhothalesfm fangornforestf2amachinelearningapproachtoclassifygenesandgenerainthefamilygeminiviridae
AT fonteselizabethpb fangornforestf2amachinelearningapproachtoclassifygenesandgenerainthefamilygeminiviridae
AT cerqueirafabior fangornforestf2amachinelearningapproachtoclassifygenesandgenerainthefamilygeminiviridae