Cargando…

Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets

It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling data sets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect sys...

Descripción completa

Detalles Bibliográficos
Autores principales: Mafessoni, Fabrizio, Prasad, Rashmi B, Groop, Leif, Hansson, Ola, Prüfer, Kay
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6185451/
https://www.ncbi.nlm.nih.gov/pubmed/30204860
http://dx.doi.org/10.1093/gbe/evy199
_version_ 1783362741017247744
author Mafessoni, Fabrizio
Prasad, Rashmi B
Groop, Leif
Hansson, Ola
Prüfer, Kay
author_facet Mafessoni, Fabrizio
Prasad, Rashmi B
Groop, Leif
Hansson, Ola
Prüfer, Kay
author_sort Mafessoni, Fabrizio
collection PubMed
description It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling data sets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined data sets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes data set, we find that coding regions are enriched for errors, where ∼1% of the higher frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are found less often than other variants in a data set that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large data sets; our observation is thus not specific to the 1000 Genomes data set. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale data sets to detect systematic errors.
format Online
Article
Text
id pubmed-6185451
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-61854512018-10-18 Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets Mafessoni, Fabrizio Prasad, Rashmi B Groop, Leif Hansson, Ola Prüfer, Kay Genome Biol Evol Gen Res It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling data sets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined data sets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes data set, we find that coding regions are enriched for errors, where ∼1% of the higher frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are found less often than other variants in a data set that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large data sets; our observation is thus not specific to the 1000 Genomes data set. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale data sets to detect systematic errors. Oxford University Press 2018-09-10 /pmc/articles/PMC6185451/ /pubmed/30204860 http://dx.doi.org/10.1093/gbe/evy199 Text en © The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Gen Res
Mafessoni, Fabrizio
Prasad, Rashmi B
Groop, Leif
Hansson, Ola
Prüfer, Kay
Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets
title Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets
title_full Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets
title_fullStr Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets
title_full_unstemmed Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets
title_short Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets
title_sort turning vice into virtue: using batch-effects to detect errors in large genomic data sets
topic Gen Res
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6185451/
https://www.ncbi.nlm.nih.gov/pubmed/30204860
http://dx.doi.org/10.1093/gbe/evy199
work_keys_str_mv AT mafessonifabrizio turningviceintovirtueusingbatcheffectstodetecterrorsinlargegenomicdatasets
AT prasadrashmib turningviceintovirtueusingbatcheffectstodetecterrorsinlargegenomicdatasets
AT groopleif turningviceintovirtueusingbatcheffectstodetecterrorsinlargegenomicdatasets
AT hanssonola turningviceintovirtueusingbatcheffectstodetecterrorsinlargegenomicdatasets
AT pruferkay turningviceintovirtueusingbatcheffectstodetecterrorsinlargegenomicdatasets