Cargando…

A new algorithm for “the LCS problem” with application in compressing genome resequencing data

BACKGROUND: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. METHODS: First, we present a new algorithm for the LCS problem. Using the generali...

Descripción completa

Detalles Bibliográficos
Autores principales:	Beal, Richard, Afrin, Tazin, Farheen, Aliya, Adjeroh, Donald
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001248/ https://www.ncbi.nlm.nih.gov/pubmed/27556803 http://dx.doi.org/10.1186/s12864-016-2793-0

_version_	1782450439838498816
author	Beal, Richard Afrin, Tazin Farheen, Aliya Adjeroh, Donald
author_facet	Beal, Richard Afrin, Tazin Farheen, Aliya Adjeroh, Donald
author_sort	Beal, Richard
collection	PubMed
description	BACKGROUND: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. METHODS: First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. RESULTS: Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011). CONCLUSION: We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.
format	Online Article Text
id	pubmed-5001248
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-50012482016-09-06 A new algorithm for “the LCS problem” with application in compressing genome resequencing data Beal, Richard Afrin, Tazin Farheen, Aliya Adjeroh, Donald BMC Genomics Research BACKGROUND: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. METHODS: First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. RESULTS: Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011). CONCLUSION: We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method. BioMed Central 2016-08-18 /pmc/articles/PMC5001248/ /pubmed/27556803 http://dx.doi.org/10.1186/s12864-016-2793-0 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Beal, Richard Afrin, Tazin Farheen, Aliya Adjeroh, Donald A new algorithm for “the LCS problem” with application in compressing genome resequencing data
title	A new algorithm for “the LCS problem” with application in compressing genome resequencing data
title_full	A new algorithm for “the LCS problem” with application in compressing genome resequencing data
title_fullStr	A new algorithm for “the LCS problem” with application in compressing genome resequencing data
title_full_unstemmed	A new algorithm for “the LCS problem” with application in compressing genome resequencing data
title_short	A new algorithm for “the LCS problem” with application in compressing genome resequencing data
title_sort	new algorithm for “the lcs problem” with application in compressing genome resequencing data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001248/ https://www.ncbi.nlm.nih.gov/pubmed/27556803 http://dx.doi.org/10.1186/s12864-016-2793-0
work_keys_str_mv	AT bealrichard anewalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata AT afrintazin anewalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata AT farheenaliya anewalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata AT adjerohdonald anewalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata AT bealrichard newalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata AT afrintazin newalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata AT farheenaliya newalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata AT adjerohdonald newalgorithmforthelcsproblemwithapplicationincompressinggenomeresequencingdata

A new algorithm for “the LCS problem” with application in compressing genome resequencing data

Ejemplares similares