Cargando…

SNPs detection by eBWT positional clustering

BACKGROUND: Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling m...

Descripción completa

Detalles Bibliográficos
Autores principales: Prezza, Nicola, Pisanti, Nadia, Sciortino, Marinella, Rosone, Giovanna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6364478/
https://www.ncbi.nlm.nih.gov/pubmed/30839919
http://dx.doi.org/10.1186/s13015-019-0137-8
_version_ 1783393287464288256
author Prezza, Nicola
Pisanti, Nadia
Sciortino, Marinella
Rosone, Giovanna
author_facet Prezza, Nicola
Pisanti, Nadia
Sciortino, Marinella
Rosone, Giovanna
author_sort Prezza, Nicola
collection PubMed
description BACKGROUND: Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. RESULTS: We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP. CONCLUSIONS: Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. AVAILABILITY: The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.
format Online
Article
Text
id pubmed-6364478
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63644782019-02-15 SNPs detection by eBWT positional clustering Prezza, Nicola Pisanti, Nadia Sciortino, Marinella Rosone, Giovanna Algorithms Mol Biol Research BACKGROUND: Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. RESULTS: We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP. CONCLUSIONS: Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. AVAILABILITY: The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp. BioMed Central 2019-02-06 /pmc/articles/PMC6364478/ /pubmed/30839919 http://dx.doi.org/10.1186/s13015-019-0137-8 Text en © The Author(s) 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Prezza, Nicola
Pisanti, Nadia
Sciortino, Marinella
Rosone, Giovanna
SNPs detection by eBWT positional clustering
title SNPs detection by eBWT positional clustering
title_full SNPs detection by eBWT positional clustering
title_fullStr SNPs detection by eBWT positional clustering
title_full_unstemmed SNPs detection by eBWT positional clustering
title_short SNPs detection by eBWT positional clustering
title_sort snps detection by ebwt positional clustering
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6364478/
https://www.ncbi.nlm.nih.gov/pubmed/30839919
http://dx.doi.org/10.1186/s13015-019-0137-8
work_keys_str_mv AT prezzanicola snpsdetectionbyebwtpositionalclustering
AT pisantinadia snpsdetectionbyebwtpositionalclustering
AT sciortinomarinella snpsdetectionbyebwtpositionalclustering
AT rosonegiovanna snpsdetectionbyebwtpositionalclustering