Cargando…

Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities

We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts...

Descripción completa

Detalles Bibliográficos
Autores principales: Kaisers , Wolfgang, Schwender, Holger, Schaal , Heiner
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274891/
https://www.ncbi.nlm.nih.gov/pubmed/30469355
http://dx.doi.org/10.3390/ijms19113687
_version_ 1783377712725884928
author Kaisers , Wolfgang
Schwender, Holger
Schaal , Heiner
author_facet Kaisers , Wolfgang
Schwender, Holger
Schaal , Heiner
author_sort Kaisers , Wolfgang
collection PubMed
description We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.
format Online
Article
Text
id pubmed-6274891
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-62748912018-12-15 Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities Kaisers , Wolfgang Schwender, Holger Schaal , Heiner Int J Mol Sci Article We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees. MDPI 2018-11-21 /pmc/articles/PMC6274891/ /pubmed/30469355 http://dx.doi.org/10.3390/ijms19113687 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Kaisers , Wolfgang
Schwender, Holger
Schaal , Heiner
Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities
title Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities
title_full Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities
title_fullStr Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities
title_full_unstemmed Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities
title_short Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities
title_sort hierarchical clustering of dna k-mer counts in rnaseq fastq files identifies sample heterogeneities
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274891/
https://www.ncbi.nlm.nih.gov/pubmed/30469355
http://dx.doi.org/10.3390/ijms19113687
work_keys_str_mv AT kaiserswolfgang hierarchicalclusteringofdnakmercountsinrnaseqfastqfilesidentifiessampleheterogeneities
AT schwenderholger hierarchicalclusteringofdnakmercountsinrnaseqfastqfilesidentifiessampleheterogeneities
AT schaalheiner hierarchicalclusteringofdnakmercountsinrnaseqfastqfilesidentifiessampleheterogeneities