Cargando…

Discordant calls across genotype discovery approaches elucidate variants with systematic errors

Large-scale high-throughput sequencing data sets have been transformative for informing clinical variant interpretation and for use as reference panels for statistical and population genetic efforts. Although such resources are often treated as ground truth, we find that in widely used reference dat...

Descripción completa

Detalles Bibliográficos
Autores principales: Atkinson, Elizabeth G., Artomov, Mykyta, Loboda, Alexander A., Rehm, Heidi L., MacArthur, Daniel G., Karczewski, Konrad J., Neale, Benjamin M., Daly, Mark J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10519400/
https://www.ncbi.nlm.nih.gov/pubmed/37253541
http://dx.doi.org/10.1101/gr.277908.123
_version_ 1785145954150645760
author Atkinson, Elizabeth G.
Artomov, Mykyta
Loboda, Alexander A.
Rehm, Heidi L.
MacArthur, Daniel G.
Karczewski, Konrad J.
Neale, Benjamin M.
Daly, Mark J.
author_facet Atkinson, Elizabeth G.
Artomov, Mykyta
Loboda, Alexander A.
Rehm, Heidi L.
MacArthur, Daniel G.
Karczewski, Konrad J.
Neale, Benjamin M.
Daly, Mark J.
author_sort Atkinson, Elizabeth G.
collection PubMed
description Large-scale high-throughput sequencing data sets have been transformative for informing clinical variant interpretation and for use as reference panels for statistical and population genetic efforts. Although such resources are often treated as ground truth, we find that in widely used reference data sets such as the Genome Aggregation Database (gnomAD), some variants pass gold-standard filters, yet are systematically different in their genotype calls across genotype discovery approaches. The inclusion of such discordant sites in study designs involving multiple genotype discovery strategies could bias results and lead to false-positive hits in association studies owing to technological artifacts rather than a true relationship to the phenotype. Here, we describe this phenomenon of discordant genotype calls across genotype discovery approaches, characterize the error mode of wrong calls, provide a list of discordant sites identified in gnomAD that should be treated with caution in analyses, and present a metric and machine learning classifier trained on gnomAD data to identify likely discordant variants in other data sets. We find that different genotype discovery approaches have different sets of variants at which this problem occurs, but there are characteristic variant features that can be used to predict discordant behavior. Discordant sites are largely shared across ancestry groups, although different populations are powered for the discovery of different variants. We find that the most common error mode is that of a variant being heterozygous for one approach and homozygous for the other, with heterozygous in the genomes and homozygous reference in the exomes making up the majority of miscalls.
format Online
Article
Text
id pubmed-10519400
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-105194002023-12-01 Discordant calls across genotype discovery approaches elucidate variants with systematic errors Atkinson, Elizabeth G. Artomov, Mykyta Loboda, Alexander A. Rehm, Heidi L. MacArthur, Daniel G. Karczewski, Konrad J. Neale, Benjamin M. Daly, Mark J. Genome Res Resource Large-scale high-throughput sequencing data sets have been transformative for informing clinical variant interpretation and for use as reference panels for statistical and population genetic efforts. Although such resources are often treated as ground truth, we find that in widely used reference data sets such as the Genome Aggregation Database (gnomAD), some variants pass gold-standard filters, yet are systematically different in their genotype calls across genotype discovery approaches. The inclusion of such discordant sites in study designs involving multiple genotype discovery strategies could bias results and lead to false-positive hits in association studies owing to technological artifacts rather than a true relationship to the phenotype. Here, we describe this phenomenon of discordant genotype calls across genotype discovery approaches, characterize the error mode of wrong calls, provide a list of discordant sites identified in gnomAD that should be treated with caution in analyses, and present a metric and machine learning classifier trained on gnomAD data to identify likely discordant variants in other data sets. We find that different genotype discovery approaches have different sets of variants at which this problem occurs, but there are characteristic variant features that can be used to predict discordant behavior. Discordant sites are largely shared across ancestry groups, although different populations are powered for the discovery of different variants. We find that the most common error mode is that of a variant being heterozygous for one approach and homozygous for the other, with heterozygous in the genomes and homozygous reference in the exomes making up the majority of miscalls. Cold Spring Harbor Laboratory Press 2023-06 /pmc/articles/PMC10519400/ /pubmed/37253541 http://dx.doi.org/10.1101/gr.277908.123 Text en © 2023 Atkinson et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle Resource
Atkinson, Elizabeth G.
Artomov, Mykyta
Loboda, Alexander A.
Rehm, Heidi L.
MacArthur, Daniel G.
Karczewski, Konrad J.
Neale, Benjamin M.
Daly, Mark J.
Discordant calls across genotype discovery approaches elucidate variants with systematic errors
title Discordant calls across genotype discovery approaches elucidate variants with systematic errors
title_full Discordant calls across genotype discovery approaches elucidate variants with systematic errors
title_fullStr Discordant calls across genotype discovery approaches elucidate variants with systematic errors
title_full_unstemmed Discordant calls across genotype discovery approaches elucidate variants with systematic errors
title_short Discordant calls across genotype discovery approaches elucidate variants with systematic errors
title_sort discordant calls across genotype discovery approaches elucidate variants with systematic errors
topic Resource
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10519400/
https://www.ncbi.nlm.nih.gov/pubmed/37253541
http://dx.doi.org/10.1101/gr.277908.123
work_keys_str_mv AT atkinsonelizabethg discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors
AT artomovmykyta discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors
AT lobodaalexandera discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors
AT rehmheidil discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors
AT macarthurdanielg discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors
AT karczewskikonradj discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors
AT nealebenjaminm discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors
AT dalymarkj discordantcallsacrossgenotypediscoveryapproacheselucidatevariantswithsystematicerrors