Cargando…

Detection and classification of peaks in 5' cap RNA sequencing data

BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identif...

Descripción completa

Detalles Bibliográficos
Autores principales: Strbenac, Dario, Armstrong, Nicola J, Yang, Jean YH
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852351/
https://www.ncbi.nlm.nih.gov/pubmed/24564843
http://dx.doi.org/10.1186/1471-2164-14-S5-S9
_version_ 1782478654213718016
author Strbenac, Dario
Armstrong, Nicola J
Yang, Jean YH
author_facet Strbenac, Dario
Armstrong, Nicola J
Yang, Jean YH
author_sort Strbenac, Dario
collection PubMed
description BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations. RESULTS: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson null distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision. CONCLUSIONS: The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data.
format Online
Article
Text
id pubmed-3852351
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38523512013-12-19 Detection and classification of peaks in 5' cap RNA sequencing data Strbenac, Dario Armstrong, Nicola J Yang, Jean YH BMC Genomics Research BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations. RESULTS: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson null distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision. CONCLUSIONS: The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data. BioMed Central 2013-10-16 /pmc/articles/PMC3852351/ /pubmed/24564843 http://dx.doi.org/10.1186/1471-2164-14-S5-S9 Text en Copyright © 2013 Strbenac et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Strbenac, Dario
Armstrong, Nicola J
Yang, Jean YH
Detection and classification of peaks in 5' cap RNA sequencing data
title Detection and classification of peaks in 5' cap RNA sequencing data
title_full Detection and classification of peaks in 5' cap RNA sequencing data
title_fullStr Detection and classification of peaks in 5' cap RNA sequencing data
title_full_unstemmed Detection and classification of peaks in 5' cap RNA sequencing data
title_short Detection and classification of peaks in 5' cap RNA sequencing data
title_sort detection and classification of peaks in 5' cap rna sequencing data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852351/
https://www.ncbi.nlm.nih.gov/pubmed/24564843
http://dx.doi.org/10.1186/1471-2164-14-S5-S9
work_keys_str_mv AT strbenacdario detectionandclassificationofpeaksin5caprnasequencingdata
AT armstrongnicolaj detectionandclassificationofpeaksin5caprnasequencingdata
AT yangjeanyh detectionandclassificationofpeaksin5caprnasequencingdata