Cargando…

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existin...

Descripción completa

Detalles Bibliográficos
Autor principal: Bansal, Vikas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374682/
https://www.ncbi.nlm.nih.gov/pubmed/28361665
http://dx.doi.org/10.1186/s12859-017-1471-9
_version_ 1782518942161436672
author Bansal, Vikas
author_facet Bansal, Vikas
author_sort Bansal, Vikas
collection PubMed
description BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. RESULTS: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. CONCLUSIONS: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1471-9) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5374682
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53746822017-04-03 A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments Bansal, Vikas BMC Bioinformatics Research BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. RESULTS: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. CONCLUSIONS: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1471-9) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-14 /pmc/articles/PMC5374682/ /pubmed/28361665 http://dx.doi.org/10.1186/s12859-017-1471-9 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Bansal, Vikas
A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_full A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_fullStr A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_full_unstemmed A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_short A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_sort computational method for estimating the pcr duplication rate in dna and rna-seq experiments
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374682/
https://www.ncbi.nlm.nih.gov/pubmed/28361665
http://dx.doi.org/10.1186/s12859-017-1471-9
work_keys_str_mv AT bansalvikas acomputationalmethodforestimatingthepcrduplicationrateindnaandrnaseqexperiments
AT bansalvikas computationalmethodforestimatingthepcrduplicationrateindnaandrnaseqexperiments