Cargando…

Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data

BACKGROUND: Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequenc...

Descripción completa

Detalles Bibliográficos
Autores principales: Dozmorov, Mikhail G, Adrianto, Indra, Giles, Cory B, Glass, Edmund, Glenn, Stuart B, Montgomery, Courtney, Sivils, Kathy L, Olson, Lorin E, Iwayama, Tomoaki, Freeman, Willard M, Lessard, Christopher J, Wren, Jonathan D
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597324/
https://www.ncbi.nlm.nih.gov/pubmed/26423047
http://dx.doi.org/10.1186/1471-2105-16-S13-S10
_version_ 1782393905051860992
author Dozmorov, Mikhail G
Adrianto, Indra
Giles, Cory B
Glass, Edmund
Glenn, Stuart B
Montgomery, Courtney
Sivils, Kathy L
Olson, Lorin E
Iwayama, Tomoaki
Freeman, Willard M
Lessard, Christopher J
Wren, Jonathan D
author_facet Dozmorov, Mikhail G
Adrianto, Indra
Giles, Cory B
Glass, Edmund
Glenn, Stuart B
Montgomery, Courtney
Sivils, Kathy L
Olson, Lorin E
Iwayama, Tomoaki
Freeman, Willard M
Lessard, Christopher J
Wren, Jonathan D
author_sort Dozmorov, Mikhail G
collection PubMed
description BACKGROUND: Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments. METHODS: We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at https://github.com/mdozmorov/RepeatSoaker, and tested its effect on the alignment statistics and the results of the enrichment analyses. RESULTS: Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data. CONCLUSIONS: Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data.
format Online
Article
Text
id pubmed-4597324
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45973242015-10-08 Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data Dozmorov, Mikhail G Adrianto, Indra Giles, Cory B Glass, Edmund Glenn, Stuart B Montgomery, Courtney Sivils, Kathy L Olson, Lorin E Iwayama, Tomoaki Freeman, Willard M Lessard, Christopher J Wren, Jonathan D BMC Bioinformatics Proceedings BACKGROUND: Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments. METHODS: We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at https://github.com/mdozmorov/RepeatSoaker, and tested its effect on the alignment statistics and the results of the enrichment analyses. RESULTS: Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data. CONCLUSIONS: Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data. BioMed Central 2015-09-25 /pmc/articles/PMC4597324/ /pubmed/26423047 http://dx.doi.org/10.1186/1471-2105-16-S13-S10 Text en Copyright © 2015 Dozmorov et al. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Dozmorov, Mikhail G
Adrianto, Indra
Giles, Cory B
Glass, Edmund
Glenn, Stuart B
Montgomery, Courtney
Sivils, Kathy L
Olson, Lorin E
Iwayama, Tomoaki
Freeman, Willard M
Lessard, Christopher J
Wren, Jonathan D
Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data
title Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data
title_full Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data
title_fullStr Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data
title_full_unstemmed Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data
title_short Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data
title_sort detrimental effects of duplicate reads and low complexity regions on rna- and chip-seq data
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597324/
https://www.ncbi.nlm.nih.gov/pubmed/26423047
http://dx.doi.org/10.1186/1471-2105-16-S13-S10
work_keys_str_mv AT dozmorovmikhailg detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT adriantoindra detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT gilescoryb detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT glassedmund detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT glennstuartb detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT montgomerycourtney detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT sivilskathyl detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT olsonlorine detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT iwayamatomoaki detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT freemanwillardm detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT lessardchristopherj detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata
AT wrenjonathand detrimentaleffectsofduplicatereadsandlowcomplexityregionsonrnaandchipseqdata