Cargando…
GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment
Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694649/ https://www.ncbi.nlm.nih.gov/pubmed/23812996 http://dx.doi.org/10.1093/bioinformatics/btt216 |
_version_ | 1782274881126137856 |
---|---|
author | Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei |
author_facet | Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei |
author_sort | Zhang, Zhaojun |
collection | PubMed |
description | Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives. Availability: The software can be downloaded at http://csbio.unc.edu/genescissors/ Contact: weiwang@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-3694649 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-36946492013-06-27 GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei Bioinformatics Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries. Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives. Availability: The software can be downloaded at http://csbio.unc.edu/genescissors/ Contact: weiwang@cs.ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2013-07-01 2013-06-19 /pmc/articles/PMC3694649/ /pubmed/23812996 http://dx.doi.org/10.1093/bioinformatics/btt216 Text en © The Author 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany Zhang, Zhaojun Huang, Shunping Wang, Jack Zhang, Xiang Pardo Manuel de Villena, Fernando McMillan, Leonard Wang, Wei GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment |
title | GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment |
title_full | GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment |
title_fullStr | GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment |
title_full_unstemmed | GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment |
title_short | GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment |
title_sort | genescissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to rna-seq reads misalignment |
topic | Ismb/Eccb 2013 Proceedings Papers Committee July 21 to July 23, 2013, Berlin, Germany |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694649/ https://www.ncbi.nlm.nih.gov/pubmed/23812996 http://dx.doi.org/10.1093/bioinformatics/btt216 |
work_keys_str_mv | AT zhangzhaojun genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT huangshunping genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT wangjack genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT zhangxiang genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT pardomanueldevillenafernando genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT mcmillanleonard genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment AT wangwei genescissorsacomprehensiveapproachtodetectingandcorrectingspurioustranscriptomeinferenceowingtornaseqreadsmisalignment |