Cargando…
A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existin...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374682/ https://www.ncbi.nlm.nih.gov/pubmed/28361665 http://dx.doi.org/10.1186/s12859-017-1471-9 |
_version_ | 1782518942161436672 |
---|---|
author | Bansal, Vikas |
author_facet | Bansal, Vikas |
author_sort | Bansal, Vikas |
collection | PubMed |
description | BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. RESULTS: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. CONCLUSIONS: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1471-9) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5374682 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-53746822017-04-03 A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments Bansal, Vikas BMC Bioinformatics Research BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. RESULTS: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. CONCLUSIONS: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1471-9) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-14 /pmc/articles/PMC5374682/ /pubmed/28361665 http://dx.doi.org/10.1186/s12859-017-1471-9 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Bansal, Vikas A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments |
title | A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments |
title_full | A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments |
title_fullStr | A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments |
title_full_unstemmed | A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments |
title_short | A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments |
title_sort | computational method for estimating the pcr duplication rate in dna and rna-seq experiments |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374682/ https://www.ncbi.nlm.nih.gov/pubmed/28361665 http://dx.doi.org/10.1186/s12859-017-1471-9 |
work_keys_str_mv | AT bansalvikas acomputationalmethodforestimatingthepcrduplicationrateindnaandrnaseqexperiments AT bansalvikas computationalmethodforestimatingthepcrduplicationrateindnaandrnaseqexperiments |