Cargando…

Effect of method of deduplication on estimation of differential gene expression using RNA-seq

BACKGROUND: RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. RESULTS: To infer the influence of different methods of removal of duplicated reads on estimation of gene expre...

Descripción completa

Detalles Bibliográficos
Autores principales: Klepikova, Anna V., Kasianov, Artem S., Chesnokov, Mikhail S., Lazarevich, Natalia L., Penin, Aleksey A., Logacheva, Maria
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5357343/
https://www.ncbi.nlm.nih.gov/pubmed/28321364
http://dx.doi.org/10.7717/peerj.3091
_version_ 1782516018949652480
author Klepikova, Anna V.
Kasianov, Artem S.
Chesnokov, Mikhail S.
Lazarevich, Natalia L.
Penin, Aleksey A.
Logacheva, Maria
author_facet Klepikova, Anna V.
Kasianov, Artem S.
Chesnokov, Mikhail S.
Lazarevich, Natalia L.
Penin, Aleksey A.
Logacheva, Maria
author_sort Klepikova, Anna V.
collection PubMed
description BACKGROUND: RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. RESULTS: To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. CONCLUSION: The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.
format Online
Article
Text
id pubmed-5357343
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-53573432017-03-20 Effect of method of deduplication on estimation of differential gene expression using RNA-seq Klepikova, Anna V. Kasianov, Artem S. Chesnokov, Mikhail S. Lazarevich, Natalia L. Penin, Aleksey A. Logacheva, Maria PeerJ Genomics BACKGROUND: RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. RESULTS: To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. CONCLUSION: The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout. PeerJ Inc. 2017-03-16 /pmc/articles/PMC5357343/ /pubmed/28321364 http://dx.doi.org/10.7717/peerj.3091 Text en ©2017 Klepikova et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Genomics
Klepikova, Anna V.
Kasianov, Artem S.
Chesnokov, Mikhail S.
Lazarevich, Natalia L.
Penin, Aleksey A.
Logacheva, Maria
Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_full Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_fullStr Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_full_unstemmed Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_short Effect of method of deduplication on estimation of differential gene expression using RNA-seq
title_sort effect of method of deduplication on estimation of differential gene expression using rna-seq
topic Genomics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5357343/
https://www.ncbi.nlm.nih.gov/pubmed/28321364
http://dx.doi.org/10.7717/peerj.3091
work_keys_str_mv AT klepikovaannav effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT kasianovartems effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT chesnokovmikhails effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT lazarevichnatalial effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT peninalekseya effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq
AT logachevamaria effectofmethodofdeduplicationonestimationofdifferentialgeneexpressionusingrnaseq