Cargando…
Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of sys...
Autores principales: | , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7814447/ https://www.ncbi.nlm.nih.gov/pubmed/33468057 http://dx.doi.org/10.1186/s12864-020-07362-8 |
_version_ | 1783638057678798848 |
---|---|
author | Weißbach, Stephan Sys, Stanislav Hewel, Charlotte Todorov, Hristo Schweiger, Susann Winter, Jennifer Pfenninger, Markus Torkamani, Ali Evans, Doug Burger, Joachim Everschor-Sitte, Karin May-Simera, Helen Louise Gerber, Susanne |
author_facet | Weißbach, Stephan Sys, Stanislav Hewel, Charlotte Todorov, Hristo Schweiger, Susann Winter, Jennifer Pfenninger, Markus Torkamani, Ali Evans, Doug Burger, Joachim Everschor-Sitte, Karin May-Simera, Helen Louise Gerber, Susanne |
author_sort | Weißbach, Stephan |
collection | PubMed |
description | BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. RESULTS: The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. CONCLUSION: We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-020-07362-8. |
format | Online Article Text |
id | pubmed-7814447 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-78144472021-01-19 Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines Weißbach, Stephan Sys, Stanislav Hewel, Charlotte Todorov, Hristo Schweiger, Susann Winter, Jennifer Pfenninger, Markus Torkamani, Ali Evans, Doug Burger, Joachim Everschor-Sitte, Karin May-Simera, Helen Louise Gerber, Susanne BMC Genomics Research Article BACKGROUND: Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. RESULTS: The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. CONCLUSION: We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-020-07362-8. BioMed Central 2021-01-19 /pmc/articles/PMC7814447/ /pubmed/33468057 http://dx.doi.org/10.1186/s12864-020-07362-8 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Weißbach, Stephan Sys, Stanislav Hewel, Charlotte Todorov, Hristo Schweiger, Susann Winter, Jennifer Pfenninger, Markus Torkamani, Ali Evans, Doug Burger, Joachim Everschor-Sitte, Karin May-Simera, Helen Louise Gerber, Susanne Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines |
title | Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines |
title_full | Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines |
title_fullStr | Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines |
title_full_unstemmed | Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines |
title_short | Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines |
title_sort | reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7814447/ https://www.ncbi.nlm.nih.gov/pubmed/33468057 http://dx.doi.org/10.1186/s12864-020-07362-8 |
work_keys_str_mv | AT weißbachstephan reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT sysstanislav reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT hewelcharlotte reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT todorovhristo reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT schweigersusann reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT winterjennifer reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT pfenningermarkus reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT torkamaniali reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT evansdoug reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT burgerjoachim reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT everschorsittekarin reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT maysimerahelenlouise reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines AT gerbersusanne reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines |