Cargando…

The impact of sequence length and number of sequences on promoter prediction performance

BACKGROUND: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identificati...

Descripción completa

Detalles Bibliográficos
Autores principales:	Carvalho, Sávio G, Guerra-Sá, Renata, de C Merschmann, Luiz H
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686783/ https://www.ncbi.nlm.nih.gov/pubmed/26695879 http://dx.doi.org/10.1186/1471-2105-16-S19-S5

_version_	1782406497195524096
author	Carvalho, Sávio G Guerra-Sá, Renata de C Merschmann, Luiz H
author_facet	Carvalho, Sávio G Guerra-Sá, Renata de C Merschmann, Luiz H
author_sort	Carvalho, Sávio G
collection	PubMed
description	BACKGROUND: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers. RESULTS: We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k-NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more. CONCLUSION: The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.
format	Online Article Text
id	pubmed-4686783
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46867832015-12-31 The impact of sequence length and number of sequences on promoter prediction performance Carvalho, Sávio G Guerra-Sá, Renata de C Merschmann, Luiz H BMC Bioinformatics Research BACKGROUND: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers. RESULTS: We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k-NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more. CONCLUSION: The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers. BioMed Central 2015-12-16 /pmc/articles/PMC4686783/ /pubmed/26695879 http://dx.doi.org/10.1186/1471-2105-16-S19-S5 Text en Copyright © 2015 Carvalho et al. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Carvalho, Sávio G Guerra-Sá, Renata de C Merschmann, Luiz H The impact of sequence length and number of sequences on promoter prediction performance
title	The impact of sequence length and number of sequences on promoter prediction performance
title_full	The impact of sequence length and number of sequences on promoter prediction performance
title_fullStr	The impact of sequence length and number of sequences on promoter prediction performance
title_full_unstemmed	The impact of sequence length and number of sequences on promoter prediction performance
title_short	The impact of sequence length and number of sequences on promoter prediction performance
title_sort	impact of sequence length and number of sequences on promoter prediction performance
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4686783/ https://www.ncbi.nlm.nih.gov/pubmed/26695879 http://dx.doi.org/10.1186/1471-2105-16-S19-S5
work_keys_str_mv	AT carvalhosaviog theimpactofsequencelengthandnumberofsequencesonpromoterpredictionperformance AT guerrasarenata theimpactofsequencelengthandnumberofsequencesonpromoterpredictionperformance AT decmerschmannluizh theimpactofsequencelengthandnumberofsequencesonpromoterpredictionperformance AT carvalhosaviog impactofsequencelengthandnumberofsequencesonpromoterpredictionperformance AT guerrasarenata impactofsequencelengthandnumberofsequencesonpromoterpredictionperformance AT decmerschmannluizh impactofsequencelengthandnumberofsequencesonpromoterpredictionperformance

The impact of sequence length and number of sequences on promoter prediction performance

Ejemplares similares