Cargando…
Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity
Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 a...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691534/ https://www.ncbi.nlm.nih.gov/pubmed/37879860 http://dx.doi.org/10.1101/gr.277175.122 |
_version_ | 1785152753897570304 |
---|---|
author | Chrisman, Brianna He, Chloe Jung, Jae-Yoon Stockham, Nate Paskov, Kelley Washington, Peter Petereit, Juli Wall, Dennis P. |
author_facet | Chrisman, Brianna He, Chloe Jung, Jae-Yoon Stockham, Nate Paskov, Kelley Washington, Peter Petereit, Juli Wall, Dennis P. |
author_sort | Chrisman, Brianna |
collection | PubMed |
description | Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity. |
format | Online Article Text |
id | pubmed-10691534 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-106915342023-12-02 Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity Chrisman, Brianna He, Chloe Jung, Jae-Yoon Stockham, Nate Paskov, Kelley Washington, Peter Petereit, Juli Wall, Dennis P. Genome Res Methods Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity. Cold Spring Harbor Laboratory Press 2023-10 /pmc/articles/PMC10691534/ /pubmed/37879860 http://dx.doi.org/10.1101/gr.277175.122 Text en © 2023 Chrisman et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) . |
spellingShingle | Methods Chrisman, Brianna He, Chloe Jung, Jae-Yoon Stockham, Nate Paskov, Kelley Washington, Peter Petereit, Juli Wall, Dennis P. Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity |
title | Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity |
title_full | Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity |
title_fullStr | Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity |
title_full_unstemmed | Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity |
title_short | Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity |
title_sort | localizing unmapped sequences with families to validate the telomere-to-telomere assembly and identify new hotspots for genetic diversity |
topic | Methods |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691534/ https://www.ncbi.nlm.nih.gov/pubmed/37879860 http://dx.doi.org/10.1101/gr.277175.122 |
work_keys_str_mv | AT chrismanbrianna localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity AT hechloe localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity AT jungjaeyoon localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity AT stockhamnate localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity AT paskovkelley localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity AT washingtonpeter localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity AT petereitjuli localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity AT walldennisp localizingunmappedsequenceswithfamiliestovalidatethetelomeretotelomereassemblyandidentifynewhotspotsforgeneticdiversity |