Cargando…

An improved alignment-free model for dna sequence similarity metric

BACKGROUND: DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free metho...

Descripción completa

Detalles Bibliográficos
Autores principales: Bao, Junpeng, Yuan, Ruiyu, Bao, Zhe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4261891/
https://www.ncbi.nlm.nih.gov/pubmed/25261973
http://dx.doi.org/10.1186/1471-2105-15-321
_version_ 1782348348494184448
author Bao, Junpeng
Yuan, Ruiyu
Bao, Zhe
author_facet Bao, Junpeng
Yuan, Ruiyu
Bao, Zhe
author_sort Bao, Junpeng
collection PubMed
description BACKGROUND: DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words’ probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality. RESULTS: This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings. CONCLUSIONS: The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization.
format Online
Article
Text
id pubmed-4261891
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42618912014-12-10 An improved alignment-free model for dna sequence similarity metric Bao, Junpeng Yuan, Ruiyu Bao, Zhe BMC Bioinformatics Methodology Article BACKGROUND: DNA Clustering is an important technology to automatically find the inherent relationships on a large scale of DNA sequences. But the DNA clustering quality can still be improved greatly. The DNA sequences similarity metric is one of the key points of clustering. The alignment-free methodology is a very popular way to calculate DNA sequence similarity. It normally converts a sequence into a feature space based on words’ probability distribution rather than directly matches strings. Existing alignment-free models, e.g. k-tuple, merely employ word frequency information and ignore many types of useful information contained in the DNA sequence, such as classifications of nucleotide bases, position and the like. It is believed that the better data mining results can be achieved with compounded information. Therefore, we present a new alignment-free model that employs compounded information to improve the DNA clustering quality. RESULTS: This paper proposes a Category-Position-Frequency (CPF) model, which utilizes the word frequency, position and classification information of nucleotide bases from DNA sequences. The CPF model converts a DNA sequence into three sequences according to the categories of nucleotide bases, and then yields a 12-dimension feature vector. The feature values are computed by an entropy based model that takes both local word frequency and position information into account. We conduct DNA clustering experiments on several datasets and compare with some mainstream alignment-free models for evaluation, including k-tuple, DMk, TSM, AMI and CV. The experiments show that CPF model is superior to other models in terms of the clustering results and optimal settings. CONCLUSIONS: The following conclusions can be drawn from the experiments. (1) The hybrid information model is better than the model based on word frequency only. (2) For DNA sequences no more than 5000 characters, the preferred size of sliding windows for CPF is two which provides a great advantage to promote system performance. (3) The CPF model is able to obtain an efficient stable performance and broad generalization. BioMed Central 2014-09-28 /pmc/articles/PMC4261891/ /pubmed/25261973 http://dx.doi.org/10.1186/1471-2105-15-321 Text en © Bao et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Bao, Junpeng
Yuan, Ruiyu
Bao, Zhe
An improved alignment-free model for dna sequence similarity metric
title An improved alignment-free model for dna sequence similarity metric
title_full An improved alignment-free model for dna sequence similarity metric
title_fullStr An improved alignment-free model for dna sequence similarity metric
title_full_unstemmed An improved alignment-free model for dna sequence similarity metric
title_short An improved alignment-free model for dna sequence similarity metric
title_sort improved alignment-free model for dna sequence similarity metric
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4261891/
https://www.ncbi.nlm.nih.gov/pubmed/25261973
http://dx.doi.org/10.1186/1471-2105-15-321
work_keys_str_mv AT baojunpeng animprovedalignmentfreemodelfordnasequencesimilaritymetric
AT yuanruiyu animprovedalignmentfreemodelfordnasequencesimilaritymetric
AT baozhe animprovedalignmentfreemodelfordnasequencesimilaritymetric
AT baojunpeng improvedalignmentfreemodelfordnasequencesimilaritymetric
AT yuanruiyu improvedalignmentfreemodelfordnasequencesimilaritymetric
AT baozhe improvedalignmentfreemodelfordnasequencesimilaritymetric