Cargando…

S-conLSH: alignment-free gapped mapping of noisy long reads

BACKGROUND: The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMR...

Descripción completa

Detalles Bibliográficos
Autores principales: Chakraborty, Angana, Morgenstern, Burkhard, Bandyopadhyay, Sanghamitra
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7879691/
https://www.ncbi.nlm.nih.gov/pubmed/33573603
http://dx.doi.org/10.1186/s12859-020-03918-3
_version_ 1783650563966107648
author Chakraborty, Angana
Morgenstern, Burkhard
Bandyopadhyay, Sanghamitra
author_facet Chakraborty, Angana
Morgenstern, Burkhard
Bandyopadhyay, Sanghamitra
author_sort Chakraborty, Angana
collection PubMed
description BACKGROUND: The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. RESULTS: We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. CONCLUSIONS: S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.
format Online
Article
Text
id pubmed-7879691
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-78796912021-02-17 S-conLSH: alignment-free gapped mapping of noisy long reads Chakraborty, Angana Morgenstern, Burkhard Bandyopadhyay, Sanghamitra BMC Bioinformatics Methodology Article BACKGROUND: The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. RESULTS: We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. CONCLUSIONS: S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis. BioMed Central 2021-02-11 /pmc/articles/PMC7879691/ /pubmed/33573603 http://dx.doi.org/10.1186/s12859-020-03918-3 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Chakraborty, Angana
Morgenstern, Burkhard
Bandyopadhyay, Sanghamitra
S-conLSH: alignment-free gapped mapping of noisy long reads
title S-conLSH: alignment-free gapped mapping of noisy long reads
title_full S-conLSH: alignment-free gapped mapping of noisy long reads
title_fullStr S-conLSH: alignment-free gapped mapping of noisy long reads
title_full_unstemmed S-conLSH: alignment-free gapped mapping of noisy long reads
title_short S-conLSH: alignment-free gapped mapping of noisy long reads
title_sort s-conlsh: alignment-free gapped mapping of noisy long reads
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7879691/
https://www.ncbi.nlm.nih.gov/pubmed/33573603
http://dx.doi.org/10.1186/s12859-020-03918-3
work_keys_str_mv AT chakrabortyangana sconlshalignmentfreegappedmappingofnoisylongreads
AT morgensternburkhard sconlshalignmentfreegappedmappingofnoisylongreads
AT bandyopadhyaysanghamitra sconlshalignmentfreegappedmappingofnoisylongreads