Cargando…
S-conLSH: alignment-free gapped mapping of noisy long reads
BACKGROUND: The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMR...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7879691/ https://www.ncbi.nlm.nih.gov/pubmed/33573603 http://dx.doi.org/10.1186/s12859-020-03918-3 |
_version_ | 1783650563966107648 |
---|---|
author | Chakraborty, Angana Morgenstern, Burkhard Bandyopadhyay, Sanghamitra |
author_facet | Chakraborty, Angana Morgenstern, Burkhard Bandyopadhyay, Sanghamitra |
author_sort | Chakraborty, Angana |
collection | PubMed |
description | BACKGROUND: The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. RESULTS: We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. CONCLUSIONS: S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis. |
format | Online Article Text |
id | pubmed-7879691 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-78796912021-02-17 S-conLSH: alignment-free gapped mapping of noisy long reads Chakraborty, Angana Morgenstern, Burkhard Bandyopadhyay, Sanghamitra BMC Bioinformatics Methodology Article BACKGROUND: The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. RESULTS: We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. CONCLUSIONS: S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis. BioMed Central 2021-02-11 /pmc/articles/PMC7879691/ /pubmed/33573603 http://dx.doi.org/10.1186/s12859-020-03918-3 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Article Chakraborty, Angana Morgenstern, Burkhard Bandyopadhyay, Sanghamitra S-conLSH: alignment-free gapped mapping of noisy long reads |
title | S-conLSH: alignment-free gapped mapping of noisy long reads |
title_full | S-conLSH: alignment-free gapped mapping of noisy long reads |
title_fullStr | S-conLSH: alignment-free gapped mapping of noisy long reads |
title_full_unstemmed | S-conLSH: alignment-free gapped mapping of noisy long reads |
title_short | S-conLSH: alignment-free gapped mapping of noisy long reads |
title_sort | s-conlsh: alignment-free gapped mapping of noisy long reads |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7879691/ https://www.ncbi.nlm.nih.gov/pubmed/33573603 http://dx.doi.org/10.1186/s12859-020-03918-3 |
work_keys_str_mv | AT chakrabortyangana sconlshalignmentfreegappedmappingofnoisylongreads AT morgensternburkhard sconlshalignmentfreegappedmappingofnoisylongreads AT bandyopadhyaysanghamitra sconlshalignmentfreegappedmappingofnoisylongreads |