Cargando…

A fast adaptive algorithm for computing whole-genome homology maps

MOTIVATION: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical m...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jain, Chirag, Koren, Sergey, Dilthey, Alexander, Phillippy, Adam M, Aluru, Srinivas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Eccb 2018: European Conference on Computational Biology Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6129286/ https://www.ncbi.nlm.nih.gov/pubmed/30423094 http://dx.doi.org/10.1093/bioinformatics/bty597

_version_	1783353774074494976
author	Jain, Chirag Koren, Sergey Dilthey, Alexander Phillippy, Adam M Aluru, Srinivas
author_facet	Jain, Chirag Koren, Sergey Dilthey, Alexander Phillippy, Adam M Aluru, Srinivas
author_sort	Jain, Chirag
collection	PubMed
description	MOTIVATION: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements. RESULTS: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be [Formula: see text] on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and [Formula: see text] identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser’s segmental duplication annotation. AVAILABILITY AND IMPLEMENTATION: https://github.com/marbl/MashMap
format	Online Article Text
id	pubmed-6129286
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-61292862018-09-12 A fast adaptive algorithm for computing whole-genome homology maps Jain, Chirag Koren, Sergey Dilthey, Alexander Phillippy, Adam M Aluru, Srinivas Bioinformatics Eccb 2018: European Conference on Computational Biology Proceedings MOTIVATION: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements. RESULTS: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be [Formula: see text] on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and [Formula: see text] identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser’s segmental duplication annotation. AVAILABILITY AND IMPLEMENTATION: https://github.com/marbl/MashMap Oxford University Press 2018-09-01 2018-09-08 /pmc/articles/PMC6129286/ /pubmed/30423094 http://dx.doi.org/10.1093/bioinformatics/bty597 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Eccb 2018: European Conference on Computational Biology Proceedings Jain, Chirag Koren, Sergey Dilthey, Alexander Phillippy, Adam M Aluru, Srinivas A fast adaptive algorithm for computing whole-genome homology maps
title	A fast adaptive algorithm for computing whole-genome homology maps
title_full	A fast adaptive algorithm for computing whole-genome homology maps
title_fullStr	A fast adaptive algorithm for computing whole-genome homology maps
title_full_unstemmed	A fast adaptive algorithm for computing whole-genome homology maps
title_short	A fast adaptive algorithm for computing whole-genome homology maps
title_sort	fast adaptive algorithm for computing whole-genome homology maps
topic	Eccb 2018: European Conference on Computational Biology Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6129286/ https://www.ncbi.nlm.nih.gov/pubmed/30423094 http://dx.doi.org/10.1093/bioinformatics/bty597
work_keys_str_mv	AT jainchirag afastadaptivealgorithmforcomputingwholegenomehomologymaps AT korensergey afastadaptivealgorithmforcomputingwholegenomehomologymaps AT diltheyalexander afastadaptivealgorithmforcomputingwholegenomehomologymaps AT phillippyadamm afastadaptivealgorithmforcomputingwholegenomehomologymaps AT alurusrinivas afastadaptivealgorithmforcomputingwholegenomehomologymaps AT jainchirag fastadaptivealgorithmforcomputingwholegenomehomologymaps AT korensergey fastadaptivealgorithmforcomputingwholegenomehomologymaps AT diltheyalexander fastadaptivealgorithmforcomputingwholegenomehomologymaps AT phillippyadamm fastadaptivealgorithmforcomputingwholegenomehomologymaps AT alurusrinivas fastadaptivealgorithmforcomputingwholegenomehomologymaps

A fast adaptive algorithm for computing whole-genome homology maps

Ejemplares similares