Cargando…

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Zhaojun, Huang, Shunping, Wang, Jack, Zhang, Xiang, Pardo Manuel de Villena, Fernando, McMillan, Leonard, Wang, Wei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2013
Materias:	Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694649/ https://www.ncbi.nlm.nih.gov/pubmed/23812996 http://dx.doi.org/10.1093/bioinformatics/btt216

_version_	1782274881126137856
author	Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei
author_facet	Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei
author_sort	Zhang, Zhaojun
collection	PubMed
description	Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives. Availability: The software can be downloaded at http://csbio.unc.edu/genescissors/ Contact: weiwang@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-3694649
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-36946492013-06-27 GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei Bioinformatics Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives. Availability: The software can be downloaded at http://csbio.unc.edu/genescissors/ Contact: weiwang@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2013-07-01 2013-06-19 /pmc/articles/PMC3694649/ /pubmed/23812996 http://dx.doi.org/10.1093/bioinformatics/btt216 Text en © The Author 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title	GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_full	GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_fullStr	GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_full_unstemmed	GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_short	GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
title_sort	genescissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to rna-seq reads misalignment
topic	Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694649/ https://www.ncbi.nlm.nih.gov/pubmed/23812996 http://dx.doi.org/10.1093/bioinformatics/btt216
work_keys_str_mv	AT zhangzhaojun genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT huangshunping genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT wangjack genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT zhangxiang genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT pardomanueldevillenafernando genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT mcmillanleonard genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT wangwei genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment

GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Ejemplares similares