Cargando…

Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities

BACKGROUND: Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA str...

Descripción completa

Detalles Bibliográficos
Autores principales: Roshan, Usman, Chikkagoudar, Satish, Livesay, Dennis R
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2248559/
https://www.ncbi.nlm.nih.gov/pubmed/18226231
http://dx.doi.org/10.1186/1471-2105-9-61
_version_ 1782151008991838208
author Roshan, Usman
Chikkagoudar, Satish
Livesay, Dennis R
author_facet Roshan, Usman
Chikkagoudar, Satish
Livesay, Dennis R
author_sort Roshan, Usman
collection PubMed
description BACKGROUND: Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence-sequence alignment programs are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using partition function match probabilities (implemented in Probalign) are significantly better than contemporary methods on heterogeneous length protein sequence datasets, thus suggesting an affinity for local alignment. RESULTS: We create a pairwise RNA-genome alignment benchmark from RFAM families with average pairwise sequence identity up to 60%. Each dataset contains a query RNA aligned to a target RNA (of the same family) embedded in a genomic sequence at least 5K nucleotides long. To simulate common conditions when exact ends of an ncRNA are unknown, each query RNA has 5' and 3' genomic flanks of size 50, 100, and 150 nucleotides. We subsequently compare the error of the Probalign program (adjusted for local alignment) to the commonly used local alignment programs HMMER, SSEARCH, and BLAST, and the popular ClustalW program with zero end-gap penalties. Parameters were optimized for each program on a small subset of the benchmark. Probalign has overall highest accuracies on the full benchmark. It leads by 10% accuracy over SSEARCH (the next best method) on 5 out of 22 families. On datasets restricted to maximum of 30% sequence identity, Probalign's overall median error is 71.2% vs. 83.4% for SSEARCH (P-value < 0.05). Furthermore, on these datasets Probalign leads SSEARCH by at least 10% on five families; SSEARCH leads Probalign by the same margin on two of the fourteen families. We also demonstrate that the Probalign mean posterior probability, compared to the normalized SSEARCH Z-score, is a better discriminator of alignment quality. All datasets and software are available online. CONCLUSION: We demonstrate, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in Probalign, provide statistically significant improvement over current approaches for identifying distantly related RNA sequences in larger genomic segments.
format Text
id pubmed-2248559
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22485592008-02-22 Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities Roshan, Usman Chikkagoudar, Satish Livesay, Dennis R BMC Bioinformatics Research Article BACKGROUND: Identification of RNA homologs within genomic stretches is difficult when pairwise sequence identity is low or unalignable flanking residues are present. In both cases structure-sequence or profile/family-sequence alignment programs become difficult to apply because of unreliable RNA structures or family alignments. As such, local sequence-sequence alignment programs are frequently used instead. We have recently demonstrated that maximal expected accuracy alignments using partition function match probabilities (implemented in Probalign) are significantly better than contemporary methods on heterogeneous length protein sequence datasets, thus suggesting an affinity for local alignment. RESULTS: We create a pairwise RNA-genome alignment benchmark from RFAM families with average pairwise sequence identity up to 60%. Each dataset contains a query RNA aligned to a target RNA (of the same family) embedded in a genomic sequence at least 5K nucleotides long. To simulate common conditions when exact ends of an ncRNA are unknown, each query RNA has 5' and 3' genomic flanks of size 50, 100, and 150 nucleotides. We subsequently compare the error of the Probalign program (adjusted for local alignment) to the commonly used local alignment programs HMMER, SSEARCH, and BLAST, and the popular ClustalW program with zero end-gap penalties. Parameters were optimized for each program on a small subset of the benchmark. Probalign has overall highest accuracies on the full benchmark. It leads by 10% accuracy over SSEARCH (the next best method) on 5 out of 22 families. On datasets restricted to maximum of 30% sequence identity, Probalign's overall median error is 71.2% vs. 83.4% for SSEARCH (P-value < 0.05). Furthermore, on these datasets Probalign leads SSEARCH by at least 10% on five families; SSEARCH leads Probalign by the same margin on two of the fourteen families. We also demonstrate that the Probalign mean posterior probability, compared to the normalized SSEARCH Z-score, is a better discriminator of alignment quality. All datasets and software are available online. CONCLUSION: We demonstrate, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in Probalign, provide statistically significant improvement over current approaches for identifying distantly related RNA sequences in larger genomic segments. BioMed Central 2008-01-28 /pmc/articles/PMC2248559/ /pubmed/18226231 http://dx.doi.org/10.1186/1471-2105-9-61 Text en Copyright © 2008 Roshan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Roshan, Usman
Chikkagoudar, Satish
Livesay, Dennis R
Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_full Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_fullStr Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_full_unstemmed Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_short Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
title_sort searching for evolutionary distant rna homologs within genomic sequences using partition function posterior probabilities
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2248559/
https://www.ncbi.nlm.nih.gov/pubmed/18226231
http://dx.doi.org/10.1186/1471-2105-9-61
work_keys_str_mv AT roshanusman searchingforevolutionarydistantrnahomologswithingenomicsequencesusingpartitionfunctionposteriorprobabilities
AT chikkagoudarsatish searchingforevolutionarydistantrnahomologswithingenomicsequencesusingpartitionfunctionposteriorprobabilities
AT livesaydennisr searchingforevolutionarydistantrnahomologswithingenomicsequencesusingpartitionfunctionposteriorprobabilities