Cargando…

Alignment-free phylogeny of whole genomes using underlying subwords

BACKGROUND: With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not...

Descripción completa

Detalles Bibliográficos
Autores principales:	Comin, Matteo, Verzotto, Davide
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3549825/ https://www.ncbi.nlm.nih.gov/pubmed/23216990 http://dx.doi.org/10.1186/1748-7188-7-34

_version_	1782256478768332800
author	Comin, Matteo Verzotto, Davide
author_facet	Comin, Matteo Verzotto, Davide
author_sort	Comin, Matteo
collection	PubMed
description	BACKGROUND: With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods. METHODS: In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of “independent” subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords. RESULTS: The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information. AVAILABILITY: http://www.dei.unipd.it/∼ciompin/main/underlying.html
format	Online Article Text
id	pubmed-3549825
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-35498252013-01-23 Alignment-free phylogeny of whole genomes using underlying subwords Comin, Matteo Verzotto, Davide Algorithms Mol Biol Research BACKGROUND: With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods. METHODS: In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of “independent” subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords. RESULTS: The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information. AVAILABILITY: http://www.dei.unipd.it/∼ciompin/main/underlying.html BioMed Central 2012-12-06 /pmc/articles/PMC3549825/ /pubmed/23216990 http://dx.doi.org/10.1186/1748-7188-7-34 Text en Copyright ©2012 Comin and Verzotto; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Comin, Matteo Verzotto, Davide Alignment-free phylogeny of whole genomes using underlying subwords
title	Alignment-free phylogeny of whole genomes using underlying subwords
title_full	Alignment-free phylogeny of whole genomes using underlying subwords
title_fullStr	Alignment-free phylogeny of whole genomes using underlying subwords
title_full_unstemmed	Alignment-free phylogeny of whole genomes using underlying subwords
title_short	Alignment-free phylogeny of whole genomes using underlying subwords
title_sort	alignment-free phylogeny of whole genomes using underlying subwords
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3549825/ https://www.ncbi.nlm.nih.gov/pubmed/23216990 http://dx.doi.org/10.1186/1748-7188-7-34
work_keys_str_mv	AT cominmatteo alignmentfreephylogenyofwholegenomesusingunderlyingsubwords AT verzottodavide alignmentfreephylogenyofwholegenomesusingunderlyingsubwords

Alignment-free phylogeny of whole genomes using underlying subwords

Ejemplares similares