Cargando…
A Method for Localizing Non-Reference Sequences to the Human Genome
As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics’ improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference ge...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8730539/ https://www.ncbi.nlm.nih.gov/pubmed/34890159 |
_version_ | 1784627153886773248 |
---|---|
author | Chrisman, Brianna Sierra Paskov, Kelley M He, Chloe Jung, Jae-Yoon Stockham, Nate Washington, Peter Yigitcan Wall, Dennis Paul |
author_facet | Chrisman, Brianna Sierra Paskov, Kelley M He, Chloe Jung, Jae-Yoon Stockham, Nate Washington, Peter Yigitcan Wall, Dennis Paul |
author_sort | Chrisman, Brianna Sierra |
collection | PubMed |
description | As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics’ improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome. We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to. We extract reads that don’t align to the reference genome, and compute the population’s distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine’s diversity crisis. |
format | Online Article Text |
id | pubmed-8730539 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
record_format | MEDLINE/PubMed |
spelling | pubmed-87305392022-01-05 A Method for Localizing Non-Reference Sequences to the Human Genome Chrisman, Brianna Sierra Paskov, Kelley M He, Chloe Jung, Jae-Yoon Stockham, Nate Washington, Peter Yigitcan Wall, Dennis Paul Pac Symp Biocomput Article As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics’ improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome. We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to. We extract reads that don’t align to the reference genome, and compute the population’s distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine’s diversity crisis. 2022 /pmc/articles/PMC8730539/ /pubmed/34890159 Text en https://creativecommons.org/licenses/by-nc/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. |
spellingShingle | Article Chrisman, Brianna Sierra Paskov, Kelley M He, Chloe Jung, Jae-Yoon Stockham, Nate Washington, Peter Yigitcan Wall, Dennis Paul A Method for Localizing Non-Reference Sequences to the Human Genome |
title | A Method for Localizing Non-Reference Sequences to the Human Genome |
title_full | A Method for Localizing Non-Reference Sequences to the Human Genome |
title_fullStr | A Method for Localizing Non-Reference Sequences to the Human Genome |
title_full_unstemmed | A Method for Localizing Non-Reference Sequences to the Human Genome |
title_short | A Method for Localizing Non-Reference Sequences to the Human Genome |
title_sort | method for localizing non-reference sequences to the human genome |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8730539/ https://www.ncbi.nlm.nih.gov/pubmed/34890159 |
work_keys_str_mv | AT chrismanbriannasierra amethodforlocalizingnonreferencesequencestothehumangenome AT paskovkelleym amethodforlocalizingnonreferencesequencestothehumangenome AT hechloe amethodforlocalizingnonreferencesequencestothehumangenome AT jungjaeyoon amethodforlocalizingnonreferencesequencestothehumangenome AT stockhamnate amethodforlocalizingnonreferencesequencestothehumangenome AT washingtonpeteryigitcan amethodforlocalizingnonreferencesequencestothehumangenome AT walldennispaul amethodforlocalizingnonreferencesequencestothehumangenome AT chrismanbriannasierra methodforlocalizingnonreferencesequencestothehumangenome AT paskovkelleym methodforlocalizingnonreferencesequencestothehumangenome AT hechloe methodforlocalizingnonreferencesequencestothehumangenome AT jungjaeyoon methodforlocalizingnonreferencesequencestothehumangenome AT stockhamnate methodforlocalizingnonreferencesequencestothehumangenome AT washingtonpeteryigitcan methodforlocalizingnonreferencesequencestothehumangenome AT walldennispaul methodforlocalizingnonreferencesequencestothehumangenome |