Cargando…
Detection and classification of peaks in 5' cap RNA sequencing data
BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identif...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852351/ https://www.ncbi.nlm.nih.gov/pubmed/24564843 http://dx.doi.org/10.1186/1471-2164-14-S5-S9 |
Sumario: | BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations. RESULTS: We develop a two-stage approach consisting of a fast and simple algorithm based on a sliding window with Poisson null distribution for detecting the genomic locations of peaks, followed by a linear support vector machine classifier to distinguish between peaks which represent the initiation of transcription and peaks that do not. Comparison of classification performance to the best existing method based on whole genome segmentation showed comparable precision and improved recall. Internal features, which are intrinsic to the data and require no further experiments, had high precision and recall rates. Addition of pooled external data or matched RNA sequencing data resulted in gains of recall with equivalent precision. CONCLUSIONS: The Poisson sliding window model is an effective and fast way of taking the peak neighbourhood into account, and finding statistically significant peaks over a range of transcript expression values. It is orders of magnitude faster than doing whole genome segmentation. The support vector classification scheme has better precision and recall than existing methods. Integrating additional datasets is shown to provide minor gains in recall, in comparison to using only the cap-sequencing data. |
---|