Cargando…

TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences

BACKGROUND: The prediction of secondary structure, i.e. the set of canonical base pairs between nucleotides, is a first step in developing an understanding of the function of an RNA sequence. The most accurate computational methods predict conserved structures for a set of homologous RNA sequences....

Descripción completa

Detalles Bibliográficos
Autores principales: Harmanci, Arif O, Sharma , Gaurav, Mathews, David H
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120699/
https://www.ncbi.nlm.nih.gov/pubmed/21507242
http://dx.doi.org/10.1186/1471-2105-12-108
_version_ 1782206737284071424
author Harmanci, Arif O
Sharma , Gaurav
Mathews, David H
author_facet Harmanci, Arif O
Sharma , Gaurav
Mathews, David H
author_sort Harmanci, Arif O
collection PubMed
description BACKGROUND: The prediction of secondary structure, i.e. the set of canonical base pairs between nucleotides, is a first step in developing an understanding of the function of an RNA sequence. The most accurate computational methods predict conserved structures for a set of homologous RNA sequences. These methods usually suffer from high computational complexity. In this paper, TurboFold, a novel and efficient method for secondary structure prediction for multiple RNA sequences, is presented. RESULTS: TurboFold takes, as input, a set of homologous RNA sequences and outputs estimates of the base pairing probabilities for each sequence. The base pairing probabilities for a sequence are estimated by combining intrinsic information, derived from the sequence itself via the nearest neighbor thermodynamic model, with extrinsic information, derived from the other sequences in the input set. For a given sequence, the extrinsic information is computed by using pairwise-sequence-alignment-based probabilities for co-incidence with each of the other sequences, along with estimated base pairing probabilities, from the previous iteration, for the other sequences. The extrinsic information is introduced as free energy modifications for base pairing in a partition function computation based on the nearest neighbor thermodynamic model. This process yields updated estimates of base pairing probability. The updated base pairing probabilities in turn are used to recompute extrinsic information, resulting in the overall iterative estimation procedure that defines TurboFold. TurboFold is benchmarked on a number of ncRNA datasets and compared against alternative secondary structure prediction methods. The iterative procedure in TurboFold is shown to improve estimates of base pairing probability with each iteration, though only small gains are obtained beyond three iterations. Secondary structures composed of base pairs with estimated probabilities higher than a significance threshold are shown to be more accurate for TurboFold than for alternative methods that estimate base pairing probabilities. TurboFold-MEA, which uses base pairing probabilities from TurboFold in a maximum expected accuracy algorithm for secondary structure prediction, has accuracy comparable to the best performing secondary structure prediction methods. The computational and memory requirements for TurboFold are modest and, in terms of sequence length and number of sequences, scale much more favorably than joint alignment and folding algorithms. CONCLUSIONS: TurboFold is an iterative probabilistic method for predicting secondary structures for multiple RNA sequences that efficiently and accurately combines the information from the comparative analysis between sequences with the thermodynamic folding model. Unlike most other multi-sequence structure prediction methods, TurboFold does not enforce strict commonality of structures and is therefore useful for predicting structures for homologous sequences that have diverged significantly. TurboFold can be downloaded as part of the RNAstructure package at http://rna.urmc.rochester.edu.
format Online
Article
Text
id pubmed-3120699
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31206992011-06-23 TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences Harmanci, Arif O Sharma , Gaurav Mathews, David H BMC Bioinformatics Research Article BACKGROUND: The prediction of secondary structure, i.e. the set of canonical base pairs between nucleotides, is a first step in developing an understanding of the function of an RNA sequence. The most accurate computational methods predict conserved structures for a set of homologous RNA sequences. These methods usually suffer from high computational complexity. In this paper, TurboFold, a novel and efficient method for secondary structure prediction for multiple RNA sequences, is presented. RESULTS: TurboFold takes, as input, a set of homologous RNA sequences and outputs estimates of the base pairing probabilities for each sequence. The base pairing probabilities for a sequence are estimated by combining intrinsic information, derived from the sequence itself via the nearest neighbor thermodynamic model, with extrinsic information, derived from the other sequences in the input set. For a given sequence, the extrinsic information is computed by using pairwise-sequence-alignment-based probabilities for co-incidence with each of the other sequences, along with estimated base pairing probabilities, from the previous iteration, for the other sequences. The extrinsic information is introduced as free energy modifications for base pairing in a partition function computation based on the nearest neighbor thermodynamic model. This process yields updated estimates of base pairing probability. The updated base pairing probabilities in turn are used to recompute extrinsic information, resulting in the overall iterative estimation procedure that defines TurboFold. TurboFold is benchmarked on a number of ncRNA datasets and compared against alternative secondary structure prediction methods. The iterative procedure in TurboFold is shown to improve estimates of base pairing probability with each iteration, though only small gains are obtained beyond three iterations. Secondary structures composed of base pairs with estimated probabilities higher than a significance threshold are shown to be more accurate for TurboFold than for alternative methods that estimate base pairing probabilities. TurboFold-MEA, which uses base pairing probabilities from TurboFold in a maximum expected accuracy algorithm for secondary structure prediction, has accuracy comparable to the best performing secondary structure prediction methods. The computational and memory requirements for TurboFold are modest and, in terms of sequence length and number of sequences, scale much more favorably than joint alignment and folding algorithms. CONCLUSIONS: TurboFold is an iterative probabilistic method for predicting secondary structures for multiple RNA sequences that efficiently and accurately combines the information from the comparative analysis between sequences with the thermodynamic folding model. Unlike most other multi-sequence structure prediction methods, TurboFold does not enforce strict commonality of structures and is therefore useful for predicting structures for homologous sequences that have diverged significantly. TurboFold can be downloaded as part of the RNAstructure package at http://rna.urmc.rochester.edu. BioMed Central 2011-04-20 /pmc/articles/PMC3120699/ /pubmed/21507242 http://dx.doi.org/10.1186/1471-2105-12-108 Text en Copyright ©2011 Harmanci et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Harmanci, Arif O
Sharma , Gaurav
Mathews, David H
TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences
title TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences
title_full TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences
title_fullStr TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences
title_full_unstemmed TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences
title_short TurboFold: Iterative probabilistic estimation of secondary structures for multiple RNA sequences
title_sort turbofold: iterative probabilistic estimation of secondary structures for multiple rna sequences
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120699/
https://www.ncbi.nlm.nih.gov/pubmed/21507242
http://dx.doi.org/10.1186/1471-2105-12-108
work_keys_str_mv AT harmanciarifo turbofolditerativeprobabilisticestimationofsecondarystructuresformultiplernasequences
AT sharmagaurav turbofolditerativeprobabilisticestimationofsecondarystructuresformultiplernasequences
AT mathewsdavidh turbofolditerativeprobabilisticestimationofsecondarystructuresformultiplernasequences