Cargando…

Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity

Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 a...

Descripción completa

Detalles Bibliográficos
Autores principales: Chrisman, Brianna, He, Chloe, Jung, Jae-Yoon, Stockham, Nate, Paskov, Kelley, Washington, Peter, Petereit, Juli, Wall, Dennis P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691534/
https://www.ncbi.nlm.nih.gov/pubmed/37879860
http://dx.doi.org/10.1101/gr.277175.122
_version_ 1785152753897570304
author Chrisman, Brianna
He, Chloe
Jung, Jae-Yoon
Stockham, Nate
Paskov, Kelley
Washington, Peter
Petereit, Juli
Wall, Dennis P.
author_facet Chrisman, Brianna
He, Chloe
Jung, Jae-Yoon
Stockham, Nate
Paskov, Kelley
Washington, Peter
Petereit, Juli
Wall, Dennis P.
author_sort Chrisman, Brianna
collection PubMed
description Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity.
format Online
Article
Text
id pubmed-10691534
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-106915342023-12-02 Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity Chrisman, Brianna He, Chloe Jung, Jae-Yoon Stockham, Nate Paskov, Kelley Washington, Peter Petereit, Juli Wall, Dennis P. Genome Res Methods Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity. Cold Spring Harbor Laboratory Press 2023-10 /pmc/articles/PMC10691534/ /pubmed/37879860 http://dx.doi.org/10.1101/gr.277175.122 Text en © 2023 Chrisman et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle Methods
Chrisman, Brianna
He, Chloe
Jung, Jae-Yoon
Stockham, Nate
Paskov, Kelley
Washington, Peter
Petereit, Juli
Wall, Dennis P.
Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity
title Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity
title_full Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity
title_fullStr Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity
title_full_unstemmed Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity
title_short Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity
title_sort localizing unmapped sequences with families to validate the telomere-to-telomere assembly and identify new hotspots for genetic diversity
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691534/
https://www.ncbi.nlm.nih.gov/pubmed/37879860
http://dx.doi.org/10.1101/gr.277175.122
work_keys_str_mv AT chrismanbrianna localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity
AT hechloe localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity
AT jungjaeyoon localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity
AT stockhamnate localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity
AT paskovkelley localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity
AT washingtonpeter localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity
AT petereitjuli localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity
AT walldennisp localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity