Cargando…

DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data

BACKGROUND: The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sample qualit...

Descripción completa

Detalles Bibliográficos
Autores principales:	Grigoriadis, Dimitris, Perdikopanis, Nikos, Georgakilas, Georgios K., Hatzigeorgiou, Artemis G.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9743497/ https://www.ncbi.nlm.nih.gov/pubmed/36510136 http://dx.doi.org/10.1186/s12859-022-04945-y

_version_	1784848735248842752
author	Grigoriadis, Dimitris Perdikopanis, Nikos Georgakilas, Georgios K. Hatzigeorgiou, Artemis G.
author_facet	Grigoriadis, Dimitris Perdikopanis, Nikos Georgakilas, Georgios K. Hatzigeorgiou, Artemis G.
author_sort	Grigoriadis, Dimitris
collection	PubMed
description	BACKGROUND: The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sample quality, there is a significant number of CAGE peaks that are not associated with transcription initiation events. This type of signal is typically attributed to technical noise and more frequently to random five-prime capping or transcription bioproducts. Thus, the need for computational methods emerges, that can accurately increase the signal-to-noise ratio in CAGE data, resulting in error-free transcription start site (TSS) annotation and quantification of regulatory region usage. In this study, we present DeepTSS, a novel computational method for processing CAGE samples, that combines genomic signal processing (GSP), structural DNA features, evolutionary conservation evidence and raw DNA sequence with Deep Learning (DL) to provide single-nucleotide TSS predictions with unprecedented levels of performance. RESULTS: To evaluate DeepTSS, we utilized experimental data, protein-coding gene annotations and computationally-derived genome segmentations by chromatin states. DeepTSS was found to outperform existing algorithms on all benchmarks, achieving 98% precision and 96% sensitivity (accuracy 95.4%) on the protein-coding gene strategy, with 96.66% of its positive predictions overlapping active chromatin, 98.27% and 92.04% co-localized with at least one transcription factor and H3K4me3 peak. CONCLUSIONS: CAGE is a key protocol in deciphering the language of transcription, however, as every experimental protocol, it suffers from biological and technical noise that can severely affect downstream analyses. DeepTSS is a novel DL-based method for effectively removing noisy CAGE signal. In contrast to existing software, DeepTSS does not require feature selection since the embedded convolutional layers can readily identify patterns and only utilize the important ones for the classification task. This study highlights the key role that DL can play in Molecular Biology, by removing the inherent flaws of experimental protocols, that form the backbone of contemporary research. Here, we show how DeepTSS can unleash the full potential of an already popular and mature method such as CAGE, and push the boundaries of coding and non-coding gene expression regulator research even further. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04945-y.
format	Online Article Text
id	pubmed-9743497
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-97434972022-12-13 DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data Grigoriadis, Dimitris Perdikopanis, Nikos Georgakilas, Georgios K. Hatzigeorgiou, Artemis G. BMC Bioinformatics Methodology BACKGROUND: The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sample quality, there is a significant number of CAGE peaks that are not associated with transcription initiation events. This type of signal is typically attributed to technical noise and more frequently to random five-prime capping or transcription bioproducts. Thus, the need for computational methods emerges, that can accurately increase the signal-to-noise ratio in CAGE data, resulting in error-free transcription start site (TSS) annotation and quantification of regulatory region usage. In this study, we present DeepTSS, a novel computational method for processing CAGE samples, that combines genomic signal processing (GSP), structural DNA features, evolutionary conservation evidence and raw DNA sequence with Deep Learning (DL) to provide single-nucleotide TSS predictions with unprecedented levels of performance. RESULTS: To evaluate DeepTSS, we utilized experimental data, protein-coding gene annotations and computationally-derived genome segmentations by chromatin states. DeepTSS was found to outperform existing algorithms on all benchmarks, achieving 98% precision and 96% sensitivity (accuracy 95.4%) on the protein-coding gene strategy, with 96.66% of its positive predictions overlapping active chromatin, 98.27% and 92.04% co-localized with at least one transcription factor and H3K4me3 peak. CONCLUSIONS: CAGE is a key protocol in deciphering the language of transcription, however, as every experimental protocol, it suffers from biological and technical noise that can severely affect downstream analyses. DeepTSS is a novel DL-based method for effectively removing noisy CAGE signal. In contrast to existing software, DeepTSS does not require feature selection since the embedded convolutional layers can readily identify patterns and only utilize the important ones for the classification task. This study highlights the key role that DL can play in Molecular Biology, by removing the inherent flaws of experimental protocols, that form the backbone of contemporary research. Here, we show how DeepTSS can unleash the full potential of an already popular and mature method such as CAGE, and push the boundaries of coding and non-coding gene expression regulator research even further. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04945-y. BioMed Central 2022-12-12 /pmc/articles/PMC9743497/ /pubmed/36510136 http://dx.doi.org/10.1186/s12859-022-04945-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Methodology Grigoriadis, Dimitris Perdikopanis, Nikos Georgakilas, Georgios K. Hatzigeorgiou, Artemis G. DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title	DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_full	DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_fullStr	DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_full_unstemmed	DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_short	DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_sort	deeptss: multi-branch convolutional neural network for transcription start site identification from cage data
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9743497/ https://www.ncbi.nlm.nih.gov/pubmed/36510136 http://dx.doi.org/10.1186/s12859-022-04945-y
work_keys_str_mv	AT grigoriadisdimitris deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata AT perdikopanisnikos deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata AT georgakilasgeorgiosk deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata AT hatzigeorgiouartemisg deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata

DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data

Ejemplares similares