Cargando…

Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment

BACKGROUND: Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nagar, Anurag, Hahsler, Michael
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846703/ https://www.ncbi.nlm.nih.gov/pubmed/24564200 http://dx.doi.org/10.1186/1471-2105-14-S11-S2

_version_	1782293471678169088
author	Nagar, Anurag Hahsler, Michael
author_facet	Nagar, Anurag Hahsler, Michael
author_sort	Nagar, Anurag
collection	PubMed
description	BACKGROUND: Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. RESULTS: In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. CONCLUSION: Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.
format	Online Article Text
id	pubmed-3846703
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-38467032013-12-06 Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment Nagar, Anurag Hahsler, Michael BMC Bioinformatics Research BACKGROUND: Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. RESULTS: In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. CONCLUSION: Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA. BioMed Central 2013-09-13 /pmc/articles/PMC3846703/ /pubmed/24564200 http://dx.doi.org/10.1186/1471-2105-14-S11-S2 Text en Copyright © 2013 Nagar and Hahsler; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Nagar, Anurag Hahsler, Michael Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
title	Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
title_full	Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
title_fullStr	Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
title_full_unstemmed	Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
title_short	Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
title_sort	fast discovery and visualization of conserved regions in dna sequences using quasi-alignment
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3846703/ https://www.ncbi.nlm.nih.gov/pubmed/24564200 http://dx.doi.org/10.1186/1471-2105-14-S11-S2
work_keys_str_mv	AT nagaranurag fastdiscoveryandvisualizationofconservedregionsindnasequencesusingquasialignment AT hahslermichael fastdiscoveryandvisualizationofconservedregionsindnasequencesusingquasialignment

Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment

Ejemplares similares