Cargando…

SPA: A Probabilistic Algorithm for Spliced Alignment

Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it...

Descripción completa

Detalles Bibliográficos
Autores principales: van Nimwegen, Erik, Paul, Nicodeme, Sheridan, Robert, Zavolan, Mihaela
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449883/
https://www.ncbi.nlm.nih.gov/pubmed/16683023
http://dx.doi.org/10.1371/journal.pgen.0020024
_version_ 1782127373043367936
author van Nimwegen, Erik
Paul, Nicodeme
Sheridan, Robert
Zavolan, Mihaela
author_facet van Nimwegen, Erik
Paul, Nicodeme
Sheridan, Robert
Zavolan, Mihaela
author_sort van Nimwegen, Erik
collection PubMed
description Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5′ and 3′ ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi.
format Text
id pubmed-1449883
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-14498832006-05-08 SPA: A Probabilistic Algorithm for Spliced Alignment van Nimwegen, Erik Paul, Nicodeme Sheridan, Robert Zavolan, Mihaela PLoS Genet Research Article Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5′ and 3′ ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi. Public Library of Science 2006-04 2006-04-28 /pmc/articles/PMC1449883/ /pubmed/16683023 http://dx.doi.org/10.1371/journal.pgen.0020024 Text en © 2006 van Nimwegen et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
van Nimwegen, Erik
Paul, Nicodeme
Sheridan, Robert
Zavolan, Mihaela
SPA: A Probabilistic Algorithm for Spliced Alignment
title SPA: A Probabilistic Algorithm for Spliced Alignment
title_full SPA: A Probabilistic Algorithm for Spliced Alignment
title_fullStr SPA: A Probabilistic Algorithm for Spliced Alignment
title_full_unstemmed SPA: A Probabilistic Algorithm for Spliced Alignment
title_short SPA: A Probabilistic Algorithm for Spliced Alignment
title_sort spa: a probabilistic algorithm for spliced alignment
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449883/
https://www.ncbi.nlm.nih.gov/pubmed/16683023
http://dx.doi.org/10.1371/journal.pgen.0020024
work_keys_str_mv AT vannimwegenerik spaaprobabilisticalgorithmforsplicedalignment
AT paulnicodeme spaaprobabilisticalgorithmforsplicedalignment
AT sheridanrobert spaaprobabilisticalgorithmforsplicedalignment
AT zavolanmihaela spaaprobabilisticalgorithmforsplicedalignment