Cargando…

The string decomposition problem and its applications to centromere analysis and assembly

MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) forme...

Descripción completa

Detalles Bibliográficos
Autores principales: Dvorkina, Tatiana, Bzikadze, Andrey V, Pevzner, Pavel A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7428072/
https://www.ncbi.nlm.nih.gov/pubmed/32657390
http://dx.doi.org/10.1093/bioinformatics/btaa454
_version_ 1783571002203045888
author Dvorkina, Tatiana
Bzikadze, Andrey V
Pevzner, Pavel A
author_facet Dvorkina, Tatiana
Bzikadze, Andrey V
Pevzner, Pavel A
author_sort Dvorkina, Tatiana
collection PubMed
description MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads. RESULTS: We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. AVAILABILITY AND IMPLEMENTATION: StringDecomposer is publicly available on https://github.com/ablab/stringdecomposer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7428072
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-74280722020-08-19 The string decomposition problem and its applications to centromere analysis and assembly Dvorkina, Tatiana Bzikadze, Andrey V Pevzner, Pavel A Bioinformatics Comparative and Functional Genomics MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads. RESULTS: We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. AVAILABILITY AND IMPLEMENTATION: StringDecomposer is publicly available on https://github.com/ablab/stringdecomposer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7428072/ /pubmed/32657390 http://dx.doi.org/10.1093/bioinformatics/btaa454 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Comparative and Functional Genomics
Dvorkina, Tatiana
Bzikadze, Andrey V
Pevzner, Pavel A
The string decomposition problem and its applications to centromere analysis and assembly
title The string decomposition problem and its applications to centromere analysis and assembly
title_full The string decomposition problem and its applications to centromere analysis and assembly
title_fullStr The string decomposition problem and its applications to centromere analysis and assembly
title_full_unstemmed The string decomposition problem and its applications to centromere analysis and assembly
title_short The string decomposition problem and its applications to centromere analysis and assembly
title_sort string decomposition problem and its applications to centromere analysis and assembly
topic Comparative and Functional Genomics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7428072/
https://www.ncbi.nlm.nih.gov/pubmed/32657390
http://dx.doi.org/10.1093/bioinformatics/btaa454
work_keys_str_mv AT dvorkinatatiana thestringdecompositionproblemanditsapplicationstocentromereanalysisandassembly
AT bzikadzeandreyv thestringdecompositionproblemanditsapplicationstocentromereanalysisandassembly
AT pevznerpavela thestringdecompositionproblemanditsapplicationstocentromereanalysisandassembly
AT dvorkinatatiana stringdecompositionproblemanditsapplicationstocentromereanalysisandassembly
AT bzikadzeandreyv stringdecompositionproblemanditsapplicationstocentromereanalysisandassembly
AT pevznerpavela stringdecompositionproblemanditsapplicationstocentromereanalysisandassembly