Cargando…

A Method for Localizing Non-Reference Sequences to the Human Genome

As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics’ improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference ge...

Descripción completa

Detalles Bibliográficos
Autores principales: Chrisman, Brianna Sierra, Paskov, Kelley M, He, Chloe, Jung, Jae-Yoon, Stockham, Nate, Washington, Peter Yigitcan, Wall, Dennis Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8730539/
https://www.ncbi.nlm.nih.gov/pubmed/34890159
_version_ 1784627153886773248
author Chrisman, Brianna Sierra
Paskov, Kelley M
He, Chloe
Jung, Jae-Yoon
Stockham, Nate
Washington, Peter Yigitcan
Wall, Dennis Paul
author_facet Chrisman, Brianna Sierra
Paskov, Kelley M
He, Chloe
Jung, Jae-Yoon
Stockham, Nate
Washington, Peter Yigitcan
Wall, Dennis Paul
author_sort Chrisman, Brianna Sierra
collection PubMed
description As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics’ improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome. We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to. We extract reads that don’t align to the reference genome, and compute the population’s distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine’s diversity crisis.
format Online
Article
Text
id pubmed-8730539
institution National Center for Biotechnology Information
language English
publishDate 2022
record_format MEDLINE/PubMed
spelling pubmed-87305392022-01-05 A Method for Localizing Non-Reference Sequences to the Human Genome Chrisman, Brianna Sierra Paskov, Kelley M He, Chloe Jung, Jae-Yoon Stockham, Nate Washington, Peter Yigitcan Wall, Dennis Paul Pac Symp Biocomput Article As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics’ improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome. We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to. We extract reads that don’t align to the reference genome, and compute the population’s distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine’s diversity crisis. 2022 /pmc/articles/PMC8730539/ /pubmed/34890159 Text en https://creativecommons.org/licenses/by-nc/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle Article
Chrisman, Brianna Sierra
Paskov, Kelley M
He, Chloe
Jung, Jae-Yoon
Stockham, Nate
Washington, Peter Yigitcan
Wall, Dennis Paul
A Method for Localizing Non-Reference Sequences to the Human Genome
title A Method for Localizing Non-Reference Sequences to the Human Genome
title_full A Method for Localizing Non-Reference Sequences to the Human Genome
title_fullStr A Method for Localizing Non-Reference Sequences to the Human Genome
title_full_unstemmed A Method for Localizing Non-Reference Sequences to the Human Genome
title_short A Method for Localizing Non-Reference Sequences to the Human Genome
title_sort method for localizing non-reference sequences to the human genome
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8730539/
https://www.ncbi.nlm.nih.gov/pubmed/34890159
work_keys_str_mv AT chrismanbriannasierra amethodforlocalizingnonreferencesequencestothehumangenome
AT paskovkelleym amethodforlocalizingnonreferencesequencestothehumangenome
AT hechloe amethodforlocalizingnonreferencesequencestothehumangenome
AT jungjaeyoon amethodforlocalizingnonreferencesequencestothehumangenome
AT stockhamnate amethodforlocalizingnonreferencesequencestothehumangenome
AT washingtonpeteryigitcan amethodforlocalizingnonreferencesequencestothehumangenome
AT walldennispaul amethodforlocalizingnonreferencesequencestothehumangenome
AT chrismanbriannasierra methodforlocalizingnonreferencesequencestothehumangenome
AT paskovkelleym methodforlocalizingnonreferencesequencestothehumangenome
AT hechloe methodforlocalizingnonreferencesequencestothehumangenome
AT jungjaeyoon methodforlocalizingnonreferencesequencestothehumangenome
AT stockhamnate methodforlocalizingnonreferencesequencestothehumangenome
AT washingtonpeteryigitcan methodforlocalizingnonreferencesequencestothehumangenome
AT walldennispaul methodforlocalizingnonreferencesequencestothehumangenome