Cargando…

Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants

BACKGROUND: In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals an...

Descripción completa

Detalles Bibliográficos
Autores principales: Rasnic, Roni, Brandes, Nadav, Zuk, Or, Linial, Michal
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6686424/
https://www.ncbi.nlm.nih.gov/pubmed/31391007
http://dx.doi.org/10.1186/s12885-019-5994-5
_version_ 1783442562470641664
author Rasnic, Roni
Brandes, Nadav
Zuk, Or
Linial, Michal
author_facet Rasnic, Roni
Brandes, Nadav
Zuk, Or
Linial, Michal
author_sort Rasnic, Roni
collection PubMed
description BACKGROUND: In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from > 10,000 patients. METHODS: Our hypothesis in this study is that whole exome sequences from blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity. RESULTS: We report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants. CONCLUSION: TCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12885-019-5994-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6686424
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-66864242019-08-12 Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants Rasnic, Roni Brandes, Nadav Zuk, Or Linial, Michal BMC Cancer Research Article BACKGROUND: In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from > 10,000 patients. METHODS: Our hypothesis in this study is that whole exome sequences from blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity. RESULTS: We report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants. CONCLUSION: TCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12885-019-5994-5) contains supplementary material, which is available to authorized users. BioMed Central 2019-08-07 /pmc/articles/PMC6686424/ /pubmed/31391007 http://dx.doi.org/10.1186/s12885-019-5994-5 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Rasnic, Roni
Brandes, Nadav
Zuk, Or
Linial, Michal
Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants
title Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants
title_full Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants
title_fullStr Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants
title_full_unstemmed Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants
title_short Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants
title_sort substantial batch effects in tcga exome sequences undermine pan-cancer analysis of germline variants
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6686424/
https://www.ncbi.nlm.nih.gov/pubmed/31391007
http://dx.doi.org/10.1186/s12885-019-5994-5
work_keys_str_mv AT rasnicroni substantialbatcheffectsintcgaexomesequencesunderminepancanceranalysisofgermlinevariants
AT brandesnadav substantialbatcheffectsintcgaexomesequencesunderminepancanceranalysisofgermlinevariants
AT zukor substantialbatcheffectsintcgaexomesequencesunderminepancanceranalysisofgermlinevariants
AT linialmichal substantialbatcheffectsintcgaexomesequencesunderminepancanceranalysisofgermlinevariants