Cargando…

Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities

BACKGROUND: Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA str...

Descripción completa

Detalles Bibliográficos
Autores principales:	Roshan, Usman, Chikkagoudar, Satish, Livesay, Dennis R
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2248559/ https://www.ncbi.nlm.nih.gov/pubmed/18226231 http://dx.doi.org/10.1186/1471-2105-9-61

_version_	1782151008991838208
author	Roshan, Usman Chikkagoudar, Satish Livesay, Dennis R
author_facet	Roshan, Usman Chikkagoudar, Satish Livesay, Dennis R
author_sort	Roshan, Usman
collection	PubMed
description	BACKGROUND: Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence-sequence alignment programs are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using partition function match probabilities (implemented in Probalign) are significantly better than contemporary methods on heterogeneous length protein sequence datasets, thus suggesting an affinity for local alignment. RESULTS: We create a pairwise RNA-genome alignment benchmark from RFAM families with average pairwise sequence identity up to 60%. Each dataset contains a query RNA aligned to a target RNA (of the same family) embedded in a genomic sequence at least 5K nucleotides long. To simulate common conditions when exact ends of an ncRNA are unknown, each query RNA has 5' and 3' genomic flanks of size 50, 100, and 150 nucleotides. We subsequently compare the error of the Probalign program (adjusted for local alignment) to the commonly used local alignment programs HMMER, SSEARCH, and BLAST, and the popular ClustalW program with zero end-gap penalties. Parameters were optimized for each program on a small subset of the benchmark. Probalign has overall highest accuracies on the full benchmark. It leads by 10% accuracy over SSEARCH (the next best method) on 5 out of 22 families. On datasets restricted to maximum of 30% sequence identity, Probalign's overall median error is 71.2% vs. 83.4% for SSEARCH (P-value < 0.05). Furthermore, on these datasets Probalign leads SSEARCH by at least 10% on five families; SSEARCH leads Probalign by the same margin on two of the fourteen families. We also demonstrate that the Probalign mean posterior probability, compared to the normalized SSEARCH Z-score, is a better discriminator of alignment quality. All datasets and software are available online. CONCLUSION: We demonstrate, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in Probalign, provide statistically significant improvement over current approaches for identifying distantly related RNA sequences in larger genomic segments.
format	Text
id	pubmed-2248559
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22485592008-02-22 Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities Roshan, Usman Chikkagoudar, Satish Livesay, Dennis R BMC Bioinformatics Research Article BACKGROUND: Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence-sequence alignment programs are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using partition function match probabilities (implemented in Probalign) are significantly better than contemporary methods on heterogeneous length protein sequence datasets, thus suggesting an affinity for local alignment. RESULTS: We create a pairwise RNA-genome alignment benchmark from RFAM families with average pairwise sequence identity up to 60%. Each dataset contains a query RNA aligned to a target RNA (of the same family) embedded in a genomic sequence at least 5K nucleotides long. To simulate common conditions when exact ends of an ncRNA are unknown, each query RNA has 5' and 3' genomic flanks of size 50, 100, and 150 nucleotides. We subsequently compare the error of the Probalign program (adjusted for local alignment) to the commonly used local alignment programs HMMER, SSEARCH, and BLAST, and the popular ClustalW program with zero end-gap penalties. Parameters were optimized for each program on a small subset of the benchmark. Probalign has overall highest accuracies on the full benchmark. It leads by 10% accuracy over SSEARCH (the next best method) on 5 out of 22 families. On datasets restricted to maximum of 30% sequence identity, Probalign's overall median error is 71.2% vs. 83.4% for SSEARCH (P-value < 0.05). Furthermore, on these datasets Probalign leads SSEARCH by at least 10% on five families; SSEARCH leads Probalign by the same margin on two of the fourteen families. We also demonstrate that the Probalign mean posterior probability, compared to the normalized SSEARCH Z-score, is a better discriminator of alignment quality. All datasets and software are available online. CONCLUSION: We demonstrate, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in Probalign, provide statistically significant improvement over current approaches for identifying distantly related RNA sequences in larger genomic segments. BioMed Central 2008-01-28 /pmc/articles/PMC2248559/ /pubmed/18226231 http://dx.doi.org/10.1186/1471-2105-9-61 Text en Copyright © 2008 Roshan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Roshan, Usman Chikkagoudar, Satish Livesay, Dennis R Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title	Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_full	Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_fullStr	Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_full_unstemmed	Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_short	Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_sort	searching for evolutionary distant rna homologs within genomic sequences using partition function posterior probabilities
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2248559/ https://www.ncbi.nlm.nih.gov/pubmed/18226231 http://dx.doi.org/10.1186/1471-2105-9-61
work_keys_str_mv	AT roshanusman searchingforevolutionarydistantrnahomologswithingenomicsequencesusingpartitionfunctionposteriorprobabilities AT chikkagoudarsatish searchingforevolutionarydistantrnahomologswithingenomicsequencesusingpartitionfunctionposteriorprobabilities AT livesaydennisr searchingforevolutionarydistantrnahomologswithingenomicsequencesusingpartitionfunctionposteriorprobabilities

Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities

Ejemplares similares