Cargando…

Towards realistic benchmarks for multiple alignments of non-coding sequences

BACKGROUND: With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequ...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Jaebum, Sinha, Saurabh
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2823711/ https://www.ncbi.nlm.nih.gov/pubmed/20102627 http://dx.doi.org/10.1186/1471-2105-11-54

_version_	1782177669382668288
author	Kim, Jaebum Sinha, Saurabh
author_facet	Kim, Jaebum Sinha, Saurabh
author_sort	Kim, Jaebum
collection	PubMed
description	BACKGROUND: With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks. RESULTS: We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments. CONCLUSION: We have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.
format	Text
id	pubmed-2823711
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28237112010-02-18 Towards realistic benchmarks for multiple alignments of non-coding sequences Kim, Jaebum Sinha, Saurabh BMC Bioinformatics Research article BACKGROUND: With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks. RESULTS: We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments. CONCLUSION: We have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors. BioMed Central 2010-01-26 /pmc/articles/PMC2823711/ /pubmed/20102627 http://dx.doi.org/10.1186/1471-2105-11-54 Text en Copyright ©2010 Kim and Sinha; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Kim, Jaebum Sinha, Saurabh Towards realistic benchmarks for multiple alignments of non-coding sequences
title	Towards realistic benchmarks for multiple alignments of non-coding sequences
title_full	Towards realistic benchmarks for multiple alignments of non-coding sequences
title_fullStr	Towards realistic benchmarks for multiple alignments of non-coding sequences
title_full_unstemmed	Towards realistic benchmarks for multiple alignments of non-coding sequences
title_short	Towards realistic benchmarks for multiple alignments of non-coding sequences
title_sort	towards realistic benchmarks for multiple alignments of non-coding sequences
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2823711/ https://www.ncbi.nlm.nih.gov/pubmed/20102627 http://dx.doi.org/10.1186/1471-2105-11-54
work_keys_str_mv	AT kimjaebum towardsrealisticbenchmarksformultiplealignmentsofnoncodingsequences AT sinhasaurabh towardsrealisticbenchmarksformultiplealignmentsofnoncodingsequences

Towards realistic benchmarks for multiple alignments of non-coding sequences

Ejemplares similares