Cargando…

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Zhaojun, Huang, Shunping, Wang, Jack, Zhang, Xiang, Pardo Manuel de Villena, Fernando, McMillan, Leonard, Wang, Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694649/
https://www.ncbi.nlm.nih.gov/pubmed/23812996
http://dx.doi.org/10.1093/bioinformatics/btt216
_version_ 1782274881126137856
author Zhang, Zhaojun
Huang, Shunping
Wang, Jack
Zhang, Xiang
Pardo Manuel de Villena, Fernando
McMillan, Leonard
Wang, Wei
author_facet Zhang, Zhaojun
Huang, Shunping
Wang, Jack
Zhang, Xiang
Pardo Manuel de Villena, Fernando
McMillan, Leonard
Wang, Wei
author_sort Zhang, Zhaojun
collection PubMed
description Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives. Availability: The software can be downloaded at http://csbio.unc.edu/genescissors/ Contact: weiwang@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3694649
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-36946492013-06-27 GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei Bioinformatics Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives. Availability: The software can be downloaded at http://csbio.unc.edu/genescissors/ Contact: weiwang@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2013-07-01 2013-06-19 /pmc/articles/PMC3694649/ /pubmed/23812996 http://dx.doi.org/10.1093/bioinformatics/btt216 Text en © The Author 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
Zhang, Zhaojun
Huang, Shunping
Wang, Jack
Zhang, Xiang
Pardo Manuel de Villena, Fernando
McMillan, Leonard
Wang, Wei
GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_full GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_fullStr GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_full_unstemmed GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_short GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_sort genescissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to rna-seq reads misalignment
topic Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694649/
https://www.ncbi.nlm.nih.gov/pubmed/23812996
http://dx.doi.org/10.1093/bioinformatics/btt216
work_keys_str_mv AT zhangzhaojun genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment
AT huangshunping genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment
AT wangjack genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment
AT zhangxiang genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment
AT pardomanueldevillenafernando genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment
AT mcmillanleonard genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment
AT wangwei genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment