Cargando…
Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions
MOTIVATION: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include cop...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860108/ https://www.ncbi.nlm.nih.gov/pubmed/28334373 http://dx.doi.org/10.1093/bioinformatics/btx133 |
_version_ | 1783307947351212032 |
---|---|
author | Wu, Steven H Schwartz, Rachel S Winter, David J Conrad, Donald F Cartwright, Reed A |
author_facet | Wu, Steven H Schwartz, Rachel S Winter, David J Conrad, Donald F Cartwright, Reed A |
author_sort | Wu, Steven H |
collection | PubMed |
description | MOTIVATION: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. RESULTS: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. AVAILABILITY AND IMPLEMENTATION: Methods and data files are available at https://github.com/CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-5860108 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-58601082018-03-23 Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions Wu, Steven H Schwartz, Rachel S Winter, David J Conrad, Donald F Cartwright, Reed A Bioinformatics Original Papers MOTIVATION: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. RESULTS: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. AVAILABILITY AND IMPLEMENTATION: Methods and data files are available at https://github.com/CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online. Oxford University Press 2017-08-01 2017-03-15 /pmc/articles/PMC5860108/ /pubmed/28334373 http://dx.doi.org/10.1093/bioinformatics/btx133 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Original Papers Wu, Steven H Schwartz, Rachel S Winter, David J Conrad, Donald F Cartwright, Reed A Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions |
title | Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions |
title_full | Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions |
title_fullStr | Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions |
title_full_unstemmed | Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions |
title_short | Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions |
title_sort | estimating error models for whole genome sequencing using mixtures of dirichlet-multinomial distributions |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860108/ https://www.ncbi.nlm.nih.gov/pubmed/28334373 http://dx.doi.org/10.1093/bioinformatics/btx133 |
work_keys_str_mv | AT wustevenh estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions AT schwartzrachels estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions AT winterdavidj estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions AT conraddonaldf estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions AT cartwrightreeda estimatingerrormodelsforwholegenomesequencingusingmixturesofdirichletmultinomialdistributions |