Cargando…

Quality control of large genome datasets

The 1000 Genomes Project (TGP) is a foundational resource that serves the biomedical community as a standard reference cohort for human genetic variation. There are now seven public versions of these genomes. The TGP Consortium produced the first by mapping its final data release against human refer...

Descripción completa

Detalles Bibliográficos
Autores principales: Robinson, Max, Joshi, Arpita, Vidyarthi, Ansh, Maccoun, Mary, Rangavajjhala, Sanjay, Glusman, Gustavo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250042/
https://www.ncbi.nlm.nih.gov/pubmed/35789587
http://dx.doi.org/10.1016/j.xhgg.2022.100123
_version_ 1784739722684268544
author Robinson, Max
Joshi, Arpita
Vidyarthi, Ansh
Maccoun, Mary
Rangavajjhala, Sanjay
Glusman, Gustavo
author_facet Robinson, Max
Joshi, Arpita
Vidyarthi, Ansh
Maccoun, Mary
Rangavajjhala, Sanjay
Glusman, Gustavo
author_sort Robinson, Max
collection PubMed
description The 1000 Genomes Project (TGP) is a foundational resource that serves the biomedical community as a standard reference cohort for human genetic variation. There are now seven public versions of these genomes. The TGP Consortium produced the first by mapping its final data release against human reference sequence GRCh37, then “lifted over” these genomes to the improved reference sequence (GRCh38) when it was released, and remapped the original data to GRCh38 with two similar pipelines. As best-practice quality validation, the pipelines that generated these versions were benchmarked against the Genome In A Bottle Consortium’s “platinum quality” genome (NA12878). The New York Genome Center recently released the results of independently resequencing the cohort at greater depth (30×), a phased version informed by the inclusion of related individuals, and independently remapped the original variant calls to GRCh38. We performed a cross-comparison evaluation of all seven versions using genome fingerprinting, which supports ultrafast genome comparison even across reference versions. We noted multiple issues, including discrepancies in cohort membership, disagreement on the overall level of variation, evidence of substandard pipeline performance on specific genomes and in specific regions of the genome, cryptic relationships between individuals, inconsistent phasing, and annotation distortions caused by the history of the reference genome itself. We therefore recommend global quality assessment by rapid genome comparisons, alongside benchmarking as part of best-practice quality assessment of large genome datasets. Our observations also help inform the decision of which version to use, to support analyses by individual researchers.
format Online
Article
Text
id pubmed-9250042
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-92500422022-07-03 Quality control of large genome datasets Robinson, Max Joshi, Arpita Vidyarthi, Ansh Maccoun, Mary Rangavajjhala, Sanjay Glusman, Gustavo HGG Adv Article The 1000 Genomes Project (TGP) is a foundational resource that serves the biomedical community as a standard reference cohort for human genetic variation. There are now seven public versions of these genomes. The TGP Consortium produced the first by mapping its final data release against human reference sequence GRCh37, then “lifted over” these genomes to the improved reference sequence (GRCh38) when it was released, and remapped the original data to GRCh38 with two similar pipelines. As best-practice quality validation, the pipelines that generated these versions were benchmarked against the Genome In A Bottle Consortium’s “platinum quality” genome (NA12878). The New York Genome Center recently released the results of independently resequencing the cohort at greater depth (30×), a phased version informed by the inclusion of related individuals, and independently remapped the original variant calls to GRCh38. We performed a cross-comparison evaluation of all seven versions using genome fingerprinting, which supports ultrafast genome comparison even across reference versions. We noted multiple issues, including discrepancies in cohort membership, disagreement on the overall level of variation, evidence of substandard pipeline performance on specific genomes and in specific regions of the genome, cryptic relationships between individuals, inconsistent phasing, and annotation distortions caused by the history of the reference genome itself. We therefore recommend global quality assessment by rapid genome comparisons, alongside benchmarking as part of best-practice quality assessment of large genome datasets. Our observations also help inform the decision of which version to use, to support analyses by individual researchers. Elsevier 2022-06-07 /pmc/articles/PMC9250042/ /pubmed/35789587 http://dx.doi.org/10.1016/j.xhgg.2022.100123 Text en © 2022 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Article
Robinson, Max
Joshi, Arpita
Vidyarthi, Ansh
Maccoun, Mary
Rangavajjhala, Sanjay
Glusman, Gustavo
Quality control of large genome datasets
title Quality control of large genome datasets
title_full Quality control of large genome datasets
title_fullStr Quality control of large genome datasets
title_full_unstemmed Quality control of large genome datasets
title_short Quality control of large genome datasets
title_sort quality control of large genome datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250042/
https://www.ncbi.nlm.nih.gov/pubmed/35789587
http://dx.doi.org/10.1016/j.xhgg.2022.100123
work_keys_str_mv AT robinsonmax qualitycontroloflargegenomedatasets
AT joshiarpita qualitycontroloflargegenomedatasets
AT vidyarthiansh qualitycontroloflargegenomedatasets
AT maccounmary qualitycontroloflargegenomedatasets
AT rangavajjhalasanjay qualitycontroloflargegenomedatasets
AT glusmangustavo qualitycontroloflargegenomedatasets