Cargando…

Poly(A) motif prediction using spectral latent features from human DNA sequences

Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif p...

Descripción completa

Detalles Bibliográficos
Autores principales: Xie, Bo, Jankovic, Boris R., Bajic, Vladimir B., Song, Le, Gao, Xin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694652/
https://www.ncbi.nlm.nih.gov/pubmed/23813000
http://dx.doi.org/10.1093/bioinformatics/btt218
_version_ 1782274881824489472
author Xie, Bo
Jankovic, Boris R.
Bajic, Vladimir B.
Song, Le
Gao, Xin
author_facet Xie, Bo
Jankovic, Boris R.
Bajic, Vladimir B.
Song, Le
Gao, Xin
author_sort Xie, Bo
collection PubMed
description Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge. Results: We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance. We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ∼30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before. Availability: http://sfb.kaust.edu.sa/Pages/Software.aspx Contact: lsong@cc.gatech.edu or xin.gao@kaust.edu.sa Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3694652
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-36946522013-06-27 Poly(A) motif prediction using spectral latent features from human DNA sequences Xie, Bo Jankovic, Boris R. Bajic, Vladimir B. Song, Le Gao, Xin Bioinformatics Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge. Results: We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance. We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ∼30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before. Availability: http://sfb.kaust.edu.sa/Pages/Software.aspx Contact: lsong@cc.gatech.edu or xin.gao@kaust.edu.sa Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2013-07-01 2013-06-19 /pmc/articles/PMC3694652/ /pubmed/23813000 http://dx.doi.org/10.1093/bioinformatics/btt218 Text en © The Author 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
Xie, Bo
Jankovic, Boris R.
Bajic, Vladimir B.
Song, Le
Gao, Xin
Poly(A) motif prediction using spectral latent features from human DNA sequences
title Poly(A) motif prediction using spectral latent features from human DNA sequences
title_full Poly(A) motif prediction using spectral latent features from human DNA sequences
title_fullStr Poly(A) motif prediction using spectral latent features from human DNA sequences
title_full_unstemmed Poly(A) motif prediction using spectral latent features from human DNA sequences
title_short Poly(A) motif prediction using spectral latent features from human DNA sequences
title_sort poly(a) motif prediction using spectral latent features from human dna sequences
topic Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694652/
https://www.ncbi.nlm.nih.gov/pubmed/23813000
http://dx.doi.org/10.1093/bioinformatics/btt218
work_keys_str_mv AT xiebo polyamotifpredictionusingspectrallatentfeaturesfromhumandnasequences
AT jankovicborisr polyamotifpredictionusingspectrallatentfeaturesfromhumandnasequences
AT bajicvladimirb polyamotifpredictionusingspectrallatentfeaturesfromhumandnasequences
AT songle polyamotifpredictionusingspectrallatentfeaturesfromhumandnasequences
AT gaoxin polyamotifpredictionusingspectrallatentfeaturesfromhumandnasequences