Cargando…

Extracting DNA words based on the sequence features: non-uniform distribution and integrity

BACKGROUND: DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm fo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Zhi, Cao, Hongyan, Cui, Yuehua, Zhang, Yanbo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727310/ https://www.ncbi.nlm.nih.gov/pubmed/26811154 http://dx.doi.org/10.1186/s12976-016-0028-3

_version_	1782411943639777280
author	Li, Zhi Cao, Hongyan Cui, Yuehua Zhang, Yanbo
author_facet	Li, Zhi Cao, Hongyan Cui, Yuehua Zhang, Yanbo
author_sort	Li, Zhi
collection	PubMed
description	BACKGROUND: DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the “words” based only on the DNA sequences. METHODS: We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract “DNA words” that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. RESULTS: The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. CONCLUSIONS: Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12976-016-0028-3) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4727310
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-47273102016-01-27 Extracting DNA words based on the sequence features: non-uniform distribution and integrity Li, Zhi Cao, Hongyan Cui, Yuehua Zhang, Yanbo Theor Biol Med Model Research BACKGROUND: DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the “words” based only on the DNA sequences. METHODS: We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract “DNA words” that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. RESULTS: The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. CONCLUSIONS: Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12976-016-0028-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-01-25 /pmc/articles/PMC4727310/ /pubmed/26811154 http://dx.doi.org/10.1186/s12976-016-0028-3 Text en © Li et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Li, Zhi Cao, Hongyan Cui, Yuehua Zhang, Yanbo Extracting DNA words based on the sequence features: non-uniform distribution and integrity
title	Extracting DNA words based on the sequence features: non-uniform distribution and integrity
title_full	Extracting DNA words based on the sequence features: non-uniform distribution and integrity
title_fullStr	Extracting DNA words based on the sequence features: non-uniform distribution and integrity
title_full_unstemmed	Extracting DNA words based on the sequence features: non-uniform distribution and integrity
title_short	Extracting DNA words based on the sequence features: non-uniform distribution and integrity
title_sort	extracting dna words based on the sequence features: non-uniform distribution and integrity
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727310/ https://www.ncbi.nlm.nih.gov/pubmed/26811154 http://dx.doi.org/10.1186/s12976-016-0028-3
work_keys_str_mv	AT lizhi extractingdnawordsbasedonthesequencefeaturesnonuniformdistributionandintegrity AT caohongyan extractingdnawordsbasedonthesequencefeaturesnonuniformdistributionandintegrity AT cuiyuehua extractingdnawordsbasedonthesequencefeaturesnonuniformdistributionandintegrity AT zhangyanbo extractingdnawordsbasedonthesequencefeaturesnonuniformdistributionandintegrity

Extracting DNA words based on the sequence features: non-uniform distribution and integrity

Ejemplares similares