Cargando…

Pan-cancer analysis of systematic batch effects on somatic sequence variations

BACKGROUND: The Cancer Genome Atlas (TCGA) is a comprehensive database that includes multi-layered cancer genome profiles. Large-scale collection of data inevitably generates batch effects introduced by differences in processing at various stages from sample collection to data generation. However, b...

Descripción completa

Detalles Bibliográficos
Autores principales: Choi, Ji-Hye, Hong, Seong-Eui, Woo, Hyun Goo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5387285/
https://www.ncbi.nlm.nih.gov/pubmed/28399795
http://dx.doi.org/10.1186/s12859-017-1627-7
_version_ 1782520915191398400
author Choi, Ji-Hye
Hong, Seong-Eui
Woo, Hyun Goo
author_facet Choi, Ji-Hye
Hong, Seong-Eui
Woo, Hyun Goo
author_sort Choi, Ji-Hye
collection PubMed
description BACKGROUND: The Cancer Genome Atlas (TCGA) is a comprehensive database that includes multi-layered cancer genome profiles. Large-scale collection of data inevitably generates batch effects introduced by differences in processing at various stages from sample collection to data generation. However, batch effects on the sequence variation and its characteristics have not been studied extensively. RESULTS: We systematically evaluated batch effects on somatic sequence variations in pan-cancer TCGA data, revealing 999 somatic variants that were batch-biased with statistical significance (P < 0.00001, Fisher’s exact test, false discovery rate ≤ 0.0027). Most of the batch-biased variants were associated with specific sample plates. The batch-biased variants, which had a unique mutational spectrum with frequent indel-type mutations, preferentially occurred at sites prone to sequencing errors, e.g., in long homopolymer runs. Non-indel type batch-biased variants were frequent at splicing sites with the unique consensus motif sequence ‘TTDTTTAGTT’. Furthermore, some batch-biased variants occur in known cancer genes, potentially causing misinterpretation of mutation profiles. CONCLUSIONS: Our strategy for identifying batch-biased variants and characterising sequence patterns might be useful in eliminating false variants and facilitating correct interpretation of sequence profiles. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1627-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5387285
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53872852017-04-11 Pan-cancer analysis of systematic batch effects on somatic sequence variations Choi, Ji-Hye Hong, Seong-Eui Woo, Hyun Goo BMC Bioinformatics Research Article BACKGROUND: The Cancer Genome Atlas (TCGA) is a comprehensive database that includes multi-layered cancer genome profiles. Large-scale collection of data inevitably generates batch effects introduced by differences in processing at various stages from sample collection to data generation. However, batch effects on the sequence variation and its characteristics have not been studied extensively. RESULTS: We systematically evaluated batch effects on somatic sequence variations in pan-cancer TCGA data, revealing 999 somatic variants that were batch-biased with statistical significance (P < 0.00001, Fisher’s exact test, false discovery rate ≤ 0.0027). Most of the batch-biased variants were associated with specific sample plates. The batch-biased variants, which had a unique mutational spectrum with frequent indel-type mutations, preferentially occurred at sites prone to sequencing errors, e.g., in long homopolymer runs. Non-indel type batch-biased variants were frequent at splicing sites with the unique consensus motif sequence ‘TTDTTTAGTT’. Furthermore, some batch-biased variants occur in known cancer genes, potentially causing misinterpretation of mutation profiles. CONCLUSIONS: Our strategy for identifying batch-biased variants and characterising sequence patterns might be useful in eliminating false variants and facilitating correct interpretation of sequence profiles. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1627-7) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-11 /pmc/articles/PMC5387285/ /pubmed/28399795 http://dx.doi.org/10.1186/s12859-017-1627-7 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Choi, Ji-Hye
Hong, Seong-Eui
Woo, Hyun Goo
Pan-cancer analysis of systematic batch effects on somatic sequence variations
title Pan-cancer analysis of systematic batch effects on somatic sequence variations
title_full Pan-cancer analysis of systematic batch effects on somatic sequence variations
title_fullStr Pan-cancer analysis of systematic batch effects on somatic sequence variations
title_full_unstemmed Pan-cancer analysis of systematic batch effects on somatic sequence variations
title_short Pan-cancer analysis of systematic batch effects on somatic sequence variations
title_sort pan-cancer analysis of systematic batch effects on somatic sequence variations
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5387285/
https://www.ncbi.nlm.nih.gov/pubmed/28399795
http://dx.doi.org/10.1186/s12859-017-1627-7
work_keys_str_mv AT choijihye pancanceranalysisofsystematicbatcheffectsonsomaticsequencevariations
AT hongseongeui pancanceranalysisofsystematicbatcheffectsonsomaticsequencevariations
AT woohyungoo pancanceranalysisofsystematicbatcheffectsonsomaticsequencevariations