Cargando…

Dot2dot: accurate whole-genome tandem repeats discovery

MOTIVATION: Large-scale sequencing projects have confirmed the hypothesis that eukaryotic DNA is rich in repetitions whose functional role needs to be elucidated. In particular, tandem repeats (TRs) (i.e. short, almost identical sequences that lie adjacent to each other) have been associated to many...

Descripción completa

Detalles Bibliográficos
Autores principales: Genovese, Loredana M, Mosca, Marco M, Pellegrini, Marco, Geraci, Filippo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6419916/
https://www.ncbi.nlm.nih.gov/pubmed/30165507
http://dx.doi.org/10.1093/bioinformatics/bty747
_version_ 1783404025005211648
author Genovese, Loredana M
Mosca, Marco M
Pellegrini, Marco
Geraci, Filippo
author_facet Genovese, Loredana M
Mosca, Marco M
Pellegrini, Marco
Geraci, Filippo
author_sort Genovese, Loredana M
collection PubMed
description MOTIVATION: Large-scale sequencing projects have confirmed the hypothesis that eukaryotic DNA is rich in repetitions whose functional role needs to be elucidated. In particular, tandem repeats (TRs) (i.e. short, almost identical sequences that lie adjacent to each other) have been associated to many cellular processes and, indeed, are also involved in several genetic disorders. The need of comprehensive lists of TRs for association studies and the absence of a computational model able to capture their variability have revived research on discovery algorithms. RESULTS: Building upon the idea that sequence similarities can be easily displayed using graphical methods, we formalized the structure that TRs induce in dot-plot matrices where a sequence is compared with itself. Leveraging on the observation that a compact representation of these matrices can be built and searched in linear time, we developed Dot2dot: an accurate algorithm fast enough to be suitable for whole-genome discovery of TRs. Experiments on five manually curated collections of TRs have shown that Dot2dot is more accurate than other established methods, and completes the analysis of the biggest known reference genome in about one day on a standard PC. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are freely available upon paper acceptance at the URL: https://github.com/Gege7177/Dot2dot. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6419916
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-64199162019-03-20 Dot2dot: accurate whole-genome tandem repeats discovery Genovese, Loredana M Mosca, Marco M Pellegrini, Marco Geraci, Filippo Bioinformatics Original Papers MOTIVATION: Large-scale sequencing projects have confirmed the hypothesis that eukaryotic DNA is rich in repetitions whose functional role needs to be elucidated. In particular, tandem repeats (TRs) (i.e. short, almost identical sequences that lie adjacent to each other) have been associated to many cellular processes and, indeed, are also involved in several genetic disorders. The need of comprehensive lists of TRs for association studies and the absence of a computational model able to capture their variability have revived research on discovery algorithms. RESULTS: Building upon the idea that sequence similarities can be easily displayed using graphical methods, we formalized the structure that TRs induce in dot-plot matrices where a sequence is compared with itself. Leveraging on the observation that a compact representation of these matrices can be built and searched in linear time, we developed Dot2dot: an accurate algorithm fast enough to be suitable for whole-genome discovery of TRs. Experiments on five manually curated collections of TRs have shown that Dot2dot is more accurate than other established methods, and completes the analysis of the biggest known reference genome in about one day on a standard PC. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are freely available upon paper acceptance at the URL: https://github.com/Gege7177/Dot2dot. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-03-15 2018-08-28 /pmc/articles/PMC6419916/ /pubmed/30165507 http://dx.doi.org/10.1093/bioinformatics/bty747 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Genovese, Loredana M
Mosca, Marco M
Pellegrini, Marco
Geraci, Filippo
Dot2dot: accurate whole-genome tandem repeats discovery
title Dot2dot: accurate whole-genome tandem repeats discovery
title_full Dot2dot: accurate whole-genome tandem repeats discovery
title_fullStr Dot2dot: accurate whole-genome tandem repeats discovery
title_full_unstemmed Dot2dot: accurate whole-genome tandem repeats discovery
title_short Dot2dot: accurate whole-genome tandem repeats discovery
title_sort dot2dot: accurate whole-genome tandem repeats discovery
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6419916/
https://www.ncbi.nlm.nih.gov/pubmed/30165507
http://dx.doi.org/10.1093/bioinformatics/bty747
work_keys_str_mv AT genoveseloredanam dot2dotaccuratewholegenometandemrepeatsdiscovery
AT moscamarcom dot2dotaccuratewholegenometandemrepeatsdiscovery
AT pellegrinimarco dot2dotaccuratewholegenometandemrepeatsdiscovery
AT geracifilippo dot2dotaccuratewholegenometandemrepeatsdiscovery