Cargando…

Towards a reference genome that captures global genetic diversity

The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome w...

Descripción completa

Detalles Bibliográficos
Autores principales: Wong, Karen H. Y., Ma, Walfred, Wei, Chun-Yu, Yeh, Erh-Chan, Lin, Wan-Jia, Wang, Elin H. F., Su, Jen-Ping, Hsieh, Feng-Jen, Kao, Hsiao-Jung, Chen, Hsiao-Huei, Chow, Stephen K., Young, Eleanor, Chu, Catherine, Poon, Annie, Yang, Chi-Fan, Lin, Dar-Shong, Hu, Yu-Feng, Wu, Jer-Yuarn, Lee, Ni-Chung, Hwu, Wuh-Liang, Boffelli, Dario, Martin, David, Xiao, Ming, Kwok, Pui-Yan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7599213/
https://www.ncbi.nlm.nih.gov/pubmed/33127893
http://dx.doi.org/10.1038/s41467-020-19311-w
Descripción
Sumario:The current human reference genome is predominantly derived from a single individual and it does not adequately reflect human genetic diversity. Here, we analyze 338 high-quality human assemblies of genetically divergent human populations to identify missing sequences in the human reference genome with breakpoint resolution. We identify 127,727 recurrent non-reference unique insertions spanning 18,048,877 bp, some of which disrupt exons and known regulatory elements. To improve genome annotations, we linearly integrate these sequences into the chromosomal assemblies and construct a Human Diversity Reference. Leveraging this reference, an average of 402,573 previously unmapped reads can be recovered for a given genome sequenced to ~40X coverage. Transcriptomic diversity among these non-reference sequences can also be directly assessed. We successfully map tens of thousands of previously discarded RNA-Seq reads to this reference and identify transcription evidence in 4781 gene loci, underlining the importance of these non-reference sequences in functional genomics. Our extensive datasets are important advances toward a comprehensive reference representation of global human genetic diversity.