Cargando…

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study

BACKGROUND: An absent word with respect to a sequence is a word that does not occur in the sequence as a factor; an absent word is minimal if all its factors on the other hand occur in that sequence. In this paper we explore the idea of using minimal absent words (MAW) to compute the distance betwee...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rahman, Mohammad Saifur, Alatabbi, Ali, Athar, Tanver, Crochemore, Maxime, Rahman, M. Sohel
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Short Report
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804535/ https://www.ncbi.nlm.nih.gov/pubmed/27004958 http://dx.doi.org/10.1186/s13104-016-1972-z

_version_	1782423038779719680
author	Rahman, Mohammad Saifur Alatabbi, Ali Athar, Tanver Crochemore, Maxime Rahman, M. Sohel
author_facet	Rahman, Mohammad Saifur Alatabbi, Ali Athar, Tanver Crochemore, Maxime Rahman, M. Sohel
author_sort	Rahman, Mohammad Saifur
collection	PubMed
description	BACKGROUND: An absent word with respect to a sequence is a word that does not occur in the sequence as a factor; an absent word is minimal if all its factors on the other hand occur in that sequence. In this paper we explore the idea of using minimal absent words (MAW) to compute the distance between two biological sequences. The motivation and rationale of our work comes from the potential advantage of being able to extract as little information as possible from large genomic sequences to reach the goal of comparing sequences in an alignment-free manner. FINDINGS: We report an experimental study on the use of absent words as a distance measure among biological sequences. We provide recommendations to use the best index based on our analysis. In particular, our analysis reveals that the best performers are: the length weighted index of relative absent word sets, the length weighted index of the symmetric difference of the MAW sets, and the Jaccard distance between the MAW sets. We also found that during the computation of the absent words, the reverse complements of the sequences should also be considered. CONCLUSION: The use of MAW to compute the distance between two biological sequences has potential advantage over alignment based methods. It is expected that this potential advantage would encourage researchers and practitioners to use this as a (dis)similarity measure in the context of sequence comparison and phylogeny reconstruction. Therefore, we present here a comparison among different possible models and indexes and pave the path for the biologists and researchers to choose an appropriate model for such comparisons. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13104-016-1972-z) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4804535
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-48045352016-03-23 Absent words and the (dis)similarity analysis of DNA sequences: an experimental study Rahman, Mohammad Saifur Alatabbi, Ali Athar, Tanver Crochemore, Maxime Rahman, M. Sohel BMC Res Notes Short Report BACKGROUND: An absent word with respect to a sequence is a word that does not occur in the sequence as a factor; an absent word is minimal if all its factors on the other hand occur in that sequence. In this paper we explore the idea of using minimal absent words (MAW) to compute the distance between two biological sequences. The motivation and rationale of our work comes from the potential advantage of being able to extract as little information as possible from large genomic sequences to reach the goal of comparing sequences in an alignment-free manner. FINDINGS: We report an experimental study on the use of absent words as a distance measure among biological sequences. We provide recommendations to use the best index based on our analysis. In particular, our analysis reveals that the best performers are: the length weighted index of relative absent word sets, the length weighted index of the symmetric difference of the MAW sets, and the Jaccard distance between the MAW sets. We also found that during the computation of the absent words, the reverse complements of the sequences should also be considered. CONCLUSION: The use of MAW to compute the distance between two biological sequences has potential advantage over alignment based methods. It is expected that this potential advantage would encourage researchers and practitioners to use this as a (dis)similarity measure in the context of sequence comparison and phylogeny reconstruction. Therefore, we present here a comparison among different possible models and indexes and pave the path for the biologists and researchers to choose an appropriate model for such comparisons. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13104-016-1972-z) contains supplementary material, which is available to authorized users. BioMed Central 2016-03-22 /pmc/articles/PMC4804535/ /pubmed/27004958 http://dx.doi.org/10.1186/s13104-016-1972-z Text en © Rahman et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Short Report Rahman, Mohammad Saifur Alatabbi, Ali Athar, Tanver Crochemore, Maxime Rahman, M. Sohel Absent words and the (dis)similarity analysis of DNA sequences: an experimental study
title	Absent words and the (dis)similarity analysis of DNA sequences: an experimental study
title_full	Absent words and the (dis)similarity analysis of DNA sequences: an experimental study
title_fullStr	Absent words and the (dis)similarity analysis of DNA sequences: an experimental study
title_full_unstemmed	Absent words and the (dis)similarity analysis of DNA sequences: an experimental study
title_short	Absent words and the (dis)similarity analysis of DNA sequences: an experimental study
title_sort	absent words and the (dis)similarity analysis of dna sequences: an experimental study
topic	Short Report
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804535/ https://www.ncbi.nlm.nih.gov/pubmed/27004958 http://dx.doi.org/10.1186/s13104-016-1972-z
work_keys_str_mv	AT rahmanmohammadsaifur absentwordsandthedissimilarityanalysisofdnasequencesanexperimentalstudy AT alatabbiali absentwordsandthedissimilarityanalysisofdnasequencesanexperimentalstudy AT athartanver absentwordsandthedissimilarityanalysisofdnasequencesanexperimentalstudy AT crochemoremaxime absentwordsandthedissimilarityanalysisofdnasequencesanexperimentalstudy AT rahmanmsohel absentwordsandthedissimilarityanalysisofdnasequencesanexperimentalstudy

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study

Ejemplares similares