Cargando…

Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models

Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimin...

Descripción completa

Detalles Bibliográficos
Autores principales: Shepard, Samuel S., McSweeny, Andrew, Serpen, Gursel, Fedorov, Alexei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3367190/
https://www.ncbi.nlm.nih.gov/pubmed/22344692
http://dx.doi.org/10.1093/nar/gks154
_version_ 1782234816271351808
author Shepard, Samuel S.
McSweeny, Andrew
Serpen, Gursel
Fedorov, Alexei
author_facet Shepard, Samuel S.
McSweeny, Andrew
Serpen, Gursel
Fedorov, Alexei
author_sort Shepard, Samuel S.
collection PubMed
description Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied to sequences lacking the genetic code such as ncRNAs and 5′-untranslated regions.
format Online
Article
Text
id pubmed-3367190
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-33671902012-06-05 Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models Shepard, Samuel S. McSweeny, Andrew Serpen, Gursel Fedorov, Alexei Nucleic Acids Res Computational Biology Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied to sequences lacking the genetic code such as ncRNAs and 5′-untranslated regions. Oxford University Press 2012-06 2012-02-16 /pmc/articles/PMC3367190/ /pubmed/22344692 http://dx.doi.org/10.1093/nar/gks154 Text en © The Author(s) 2012. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Shepard, Samuel S.
McSweeny, Andrew
Serpen, Gursel
Fedorov, Alexei
Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models
title Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models
title_full Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models
title_fullStr Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models
title_full_unstemmed Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models
title_short Exploiting mid-range DNA patterns for sequence classification: binary abstraction Markov models
title_sort exploiting mid-range dna patterns for sequence classification: binary abstraction markov models
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3367190/
https://www.ncbi.nlm.nih.gov/pubmed/22344692
http://dx.doi.org/10.1093/nar/gks154
work_keys_str_mv AT shepardsamuels exploitingmidrangednapatternsforsequenceclassificationbinaryabstractionmarkovmodels
AT mcsweenyandrew exploitingmidrangednapatternsforsequenceclassificationbinaryabstractionmarkovmodels
AT serpengursel exploitingmidrangednapatternsforsequenceclassificationbinaryabstractionmarkovmodels
AT fedorovalexei exploitingmidrangednapatternsforsequenceclassificationbinaryabstractionmarkovmodels