Cargando…

Geminivirus data warehouse: a database enriched with machine learning approaches

BACKGROUND: The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine...

Descripción completa

Detalles Bibliográficos
Autores principales: Silva, Jose Cleydson F., Carvalho, Thales F. M., Basso, Marcos F., Deguchi, Michihito, Pereira, Welison A., Sobrinho, Roberto R., Vidigal, Pedro M. P., Brustolini, Otávio J. B., Silva, Fabyano F., Dal-Bianco, Maximiller, Fontes, Renildes L. F., Santos, Anésia A., Zerbini, Francisco Murilo, Cerqueira, Fabio R., Fontes, Elizabeth P. B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5420152/
https://www.ncbi.nlm.nih.gov/pubmed/28476106
http://dx.doi.org/10.1186/s12859-017-1646-4
_version_ 1783234357536751616
author Silva, Jose Cleydson F.
Carvalho, Thales F. M.
Basso, Marcos F.
Deguchi, Michihito
Pereira, Welison A.
Sobrinho, Roberto R.
Vidigal, Pedro M. P.
Brustolini, Otávio J. B.
Silva, Fabyano F.
Dal-Bianco, Maximiller
Fontes, Renildes L. F.
Santos, Anésia A.
Zerbini, Francisco Murilo
Cerqueira, Fabio R.
Fontes, Elizabeth P. B.
author_facet Silva, Jose Cleydson F.
Carvalho, Thales F. M.
Basso, Marcos F.
Deguchi, Michihito
Pereira, Welison A.
Sobrinho, Roberto R.
Vidigal, Pedro M. P.
Brustolini, Otávio J. B.
Silva, Fabyano F.
Dal-Bianco, Maximiller
Fontes, Renildes L. F.
Santos, Anésia A.
Zerbini, Francisco Murilo
Cerqueira, Fabio R.
Fontes, Elizabeth P. B.
author_sort Silva, Jose Cleydson F.
collection PubMed
description BACKGROUND: The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. RESULTS: Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. CONCLUSIONS: The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1646-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5420152
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54201522017-05-08 Geminivirus data warehouse: a database enriched with machine learning approaches Silva, Jose Cleydson F. Carvalho, Thales F. M. Basso, Marcos F. Deguchi, Michihito Pereira, Welison A. Sobrinho, Roberto R. Vidigal, Pedro M. P. Brustolini, Otávio J. B. Silva, Fabyano F. Dal-Bianco, Maximiller Fontes, Renildes L. F. Santos, Anésia A. Zerbini, Francisco Murilo Cerqueira, Fabio R. Fontes, Elizabeth P. B. BMC Bioinformatics Database BACKGROUND: The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. RESULTS: Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. CONCLUSIONS: The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1646-4) contains supplementary material, which is available to authorized users. BioMed Central 2017-05-05 /pmc/articles/PMC5420152/ /pubmed/28476106 http://dx.doi.org/10.1186/s12859-017-1646-4 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Database
Silva, Jose Cleydson F.
Carvalho, Thales F. M.
Basso, Marcos F.
Deguchi, Michihito
Pereira, Welison A.
Sobrinho, Roberto R.
Vidigal, Pedro M. P.
Brustolini, Otávio J. B.
Silva, Fabyano F.
Dal-Bianco, Maximiller
Fontes, Renildes L. F.
Santos, Anésia A.
Zerbini, Francisco Murilo
Cerqueira, Fabio R.
Fontes, Elizabeth P. B.
Geminivirus data warehouse: a database enriched with machine learning approaches
title Geminivirus data warehouse: a database enriched with machine learning approaches
title_full Geminivirus data warehouse: a database enriched with machine learning approaches
title_fullStr Geminivirus data warehouse: a database enriched with machine learning approaches
title_full_unstemmed Geminivirus data warehouse: a database enriched with machine learning approaches
title_short Geminivirus data warehouse: a database enriched with machine learning approaches
title_sort geminivirus data warehouse: a database enriched with machine learning approaches
topic Database
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5420152/
https://www.ncbi.nlm.nih.gov/pubmed/28476106
http://dx.doi.org/10.1186/s12859-017-1646-4
work_keys_str_mv AT silvajosecleydsonf geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT carvalhothalesfm geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT bassomarcosf geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT deguchimichihito geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT pereirawelisona geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT sobrinhorobertor geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT vidigalpedromp geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT brustoliniotaviojb geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT silvafabyanof geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT dalbiancomaximiller geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT fontesrenildeslf geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT santosanesiaa geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT zerbinifranciscomurilo geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT cerqueirafabior geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches
AT fonteselizabethpb geminivirusdatawarehouseadatabaseenrichedwithmachinelearningapproaches