Cargando…

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

MOTIVATION: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include cop...

Descripción completa

Detalles Bibliográficos
Autores principales: Wu, Steven H, Schwartz, Rachel S, Winter, David J, Conrad, Donald F, Cartwright, Reed A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860108/
https://www.ncbi.nlm.nih.gov/pubmed/28334373
http://dx.doi.org/10.1093/bioinformatics/btx133
_version_ 1783307947351212032
author Wu, Steven H
Schwartz, Rachel S
Winter, David J
Conrad, Donald F
Cartwright, Reed A
author_facet Wu, Steven H
Schwartz, Rachel S
Winter, David J
Conrad, Donald F
Cartwright, Reed A
author_sort Wu, Steven H
collection PubMed
description MOTIVATION: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. RESULTS: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. AVAILABILITY AND IMPLEMENTATION: Methods and data files are available at https://github.com/CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.
format Online
Article
Text
id pubmed-5860108
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58601082018-03-23 Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions Wu, Steven H Schwartz, Rachel S Winter, David J Conrad, Donald F Cartwright, Reed A Bioinformatics Original Papers MOTIVATION: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. RESULTS: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. AVAILABILITY AND IMPLEMENTATION: Methods and data files are available at https://github.com/CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online. Oxford University Press 2017-08-01 2017-03-15 /pmc/articles/PMC5860108/ /pubmed/28334373 http://dx.doi.org/10.1093/bioinformatics/btx133 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Wu, Steven H
Schwartz, Rachel S
Winter, David J
Conrad, Donald F
Cartwright, Reed A
Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions
title Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions
title_full Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions
title_fullStr Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions
title_full_unstemmed Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions
title_short Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions
title_sort estimating error models for whole genome sequencing using mixtures of dirichlet-multinomial distributions
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860108/
https://www.ncbi.nlm.nih.gov/pubmed/28334373
http://dx.doi.org/10.1093/bioinformatics/btx133
work_keys_str_mv AT wustevenh estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions
AT schwartzrachels estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions
AT winterdavidj estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions
AT conraddonaldf estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions
AT cartwrightreeda estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions