Cargando…
Detection and classification of peaks in 5' cap RNA sequencing data
BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identif...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852351/ https://www.ncbi.nlm.nih.gov/pubmed/24564843 http://dx.doi.org/10.1186/1471-2164-14-S5-S9 |
_version_ | 1782478654213718016 |
---|---|
author | Strbenac, Dario Armstrong, Nicola J Yang, Jean YH |
author_facet | Strbenac, Dario Armstrong, Nicola J Yang, Jean YH |
author_sort | Strbenac, Dario |
collection | PubMed |
description | BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations. RESULTS: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson null distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision. CONCLUSIONS: The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data. |
format | Online Article Text |
id | pubmed-3852351 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-38523512013-12-19 Detection and classification of peaks in 5' cap RNA sequencing data Strbenac, Dario Armstrong, Nicola J Yang, Jean YH BMC Genomics Research BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations. RESULTS: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson null distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision. CONCLUSIONS: The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data. BioMed Central 2013-10-16 /pmc/articles/PMC3852351/ /pubmed/24564843 http://dx.doi.org/10.1186/1471-2164-14-S5-S9 Text en Copyright © 2013 Strbenac et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Strbenac, Dario Armstrong, Nicola J Yang, Jean YH Detection and classification of peaks in 5' cap RNA sequencing data |
title | Detection and classification of peaks in 5' cap RNA sequencing data |
title_full | Detection and classification of peaks in 5' cap RNA sequencing data |
title_fullStr | Detection and classification of peaks in 5' cap RNA sequencing data |
title_full_unstemmed | Detection and classification of peaks in 5' cap RNA sequencing data |
title_short | Detection and classification of peaks in 5' cap RNA sequencing data |
title_sort | detection and classification of peaks in 5' cap rna sequencing data |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852351/ https://www.ncbi.nlm.nih.gov/pubmed/24564843 http://dx.doi.org/10.1186/1471-2164-14-S5-S9 |
work_keys_str_mv | AT strbenacdario detectionandclassificationofpeaksin5caprnasequencingdata AT armstrongnicolaj detectionandclassificationofpeaksin5caprnasequencingdata AT yangjeanyh detectionandclassificationofpeaksin5caprnasequencingdata |