Cargando…

Exploration of multivariate analysis in microbial coding sequence modeling

BACKGROUND: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identificatio...

Descripción completa

Detalles Bibliográficos
Autores principales: Mehmood, Tahir, Bohlin, Jon, Kristoffersen, Anja Bråthen, Sæbø, Solve, Warringer, Jonas, Snipen, Lars
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3473301/
https://www.ncbi.nlm.nih.gov/pubmed/22583558
http://dx.doi.org/10.1186/1471-2105-13-97
_version_ 1782246742610149376
author Mehmood, Tahir
Bohlin, Jon
Kristoffersen, Anja Bråthen
Sæbø, Solve
Warringer, Jonas
Snipen, Lars
author_facet Mehmood, Tahir
Bohlin, Jon
Kristoffersen, Anja Bråthen
Sæbø, Solve
Warringer, Jonas
Snipen, Lars
author_sort Mehmood, Tahir
collection PubMed
description BACKGROUND: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties. RESULTS: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001). CONCLUSIONS: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.
format Online
Article
Text
id pubmed-3473301
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34733012012-10-23 Exploration of multivariate analysis in microbial coding sequence modeling Mehmood, Tahir Bohlin, Jon Kristoffersen, Anja Bråthen Sæbø, Solve Warringer, Jonas Snipen, Lars BMC Bioinformatics Research Article BACKGROUND: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties. RESULTS: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001). CONCLUSIONS: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies. BioMed Central 2012-05-14 /pmc/articles/PMC3473301/ /pubmed/22583558 http://dx.doi.org/10.1186/1471-2105-13-97 Text en Copyright ©2012 Mehmood et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Mehmood, Tahir
Bohlin, Jon
Kristoffersen, Anja Bråthen
Sæbø, Solve
Warringer, Jonas
Snipen, Lars
Exploration of multivariate analysis in microbial coding sequence modeling
title Exploration of multivariate analysis in microbial coding sequence modeling
title_full Exploration of multivariate analysis in microbial coding sequence modeling
title_fullStr Exploration of multivariate analysis in microbial coding sequence modeling
title_full_unstemmed Exploration of multivariate analysis in microbial coding sequence modeling
title_short Exploration of multivariate analysis in microbial coding sequence modeling
title_sort exploration of multivariate analysis in microbial coding sequence modeling
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3473301/
https://www.ncbi.nlm.nih.gov/pubmed/22583558
http://dx.doi.org/10.1186/1471-2105-13-97
work_keys_str_mv AT mehmoodtahir explorationofmultivariateanalysisinmicrobialcodingsequencemodeling
AT bohlinjon explorationofmultivariateanalysisinmicrobialcodingsequencemodeling
AT kristoffersenanjabrathen explorationofmultivariateanalysisinmicrobialcodingsequencemodeling
AT sæbøsolve explorationofmultivariateanalysisinmicrobialcodingsequencemodeling
AT warringerjonas explorationofmultivariateanalysisinmicrobialcodingsequencemodeling
AT snipenlars explorationofmultivariateanalysisinmicrobialcodingsequencemodeling