Cargando…

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existin...

Descripción completa

Detalles Bibliográficos
Autor principal:	Bansal, Vikas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374682/ https://www.ncbi.nlm.nih.gov/pubmed/28361665 http://dx.doi.org/10.1186/s12859-017-1471-9

_version_	1782518942161436672
author	Bansal, Vikas
author_facet	Bansal, Vikas
author_sort	Bansal, Vikas
collection	PubMed
description	BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. RESULTS: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. CONCLUSIONS: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1471-9) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5374682
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53746822017-04-03 A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments Bansal, Vikas BMC Bioinformatics Research BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. RESULTS: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. CONCLUSIONS: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1471-9) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-14 /pmc/articles/PMC5374682/ /pubmed/28361665 http://dx.doi.org/10.1186/s12859-017-1471-9 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Bansal, Vikas A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title	A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_full	A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_fullStr	A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_full_unstemmed	A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_short	A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments
title_sort	computational method for estimating the pcr duplication rate in dna and rna-seq experiments
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5374682/ https://www.ncbi.nlm.nih.gov/pubmed/28361665 http://dx.doi.org/10.1186/s12859-017-1471-9
work_keys_str_mv	AT bansalvikas acomputationalmethodforestimatingthepcrduplicationrateindnaandrnaseqexperiments AT bansalvikas computationalmethodforestimatingthepcrduplicationrateindnaandrnaseqexperiments

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Ejemplares similares