Cargando…

Deep learning and support vector machines for transcription start site identification

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptional...

Descripción completa

Detalles Bibliográficos
Autores principales:	Barbero-Aparicio, José A., Olivares-Gil, Alicia, Díez-Pastor, José F., García-Osorio, César
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280436/ https://www.ncbi.nlm.nih.gov/pubmed/37346545 http://dx.doi.org/10.7717/peerj-cs.1340

_version_	1785060793765593088
author	Barbero-Aparicio, José A. Olivares-Gil, Alicia Díez-Pastor, José F. García-Osorio, César
author_facet	Barbero-Aparicio, José A. Olivares-Gil, Alicia Díez-Pastor, José F. García-Osorio, César
author_sort	Barbero-Aparicio, José A.
collection	PubMed
description	Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
format	Online Article Text
id	pubmed-10280436
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-102804362023-06-21 Deep learning and support vector machines for transcription start site identification Barbero-Aparicio, José A. Olivares-Gil, Alicia Díez-Pastor, José F. García-Osorio, César PeerJ Comput Sci Bioinformatics Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments. PeerJ Inc. 2023-04-17 /pmc/articles/PMC10280436/ /pubmed/37346545 http://dx.doi.org/10.7717/peerj-cs.1340 Text en ©2023 Barbero-Aparicio et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Barbero-Aparicio, José A. Olivares-Gil, Alicia Díez-Pastor, José F. García-Osorio, César Deep learning and support vector machines for transcription start site identification
title	Deep learning and support vector machines for transcription start site identification
title_full	Deep learning and support vector machines for transcription start site identification
title_fullStr	Deep learning and support vector machines for transcription start site identification
title_full_unstemmed	Deep learning and support vector machines for transcription start site identification
title_short	Deep learning and support vector machines for transcription start site identification
title_sort	deep learning and support vector machines for transcription start site identification
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280436/ https://www.ncbi.nlm.nih.gov/pubmed/37346545 http://dx.doi.org/10.7717/peerj-cs.1340
work_keys_str_mv	AT barberoapariciojosea deeplearningandsupportvectormachinesfortranscriptionstartsiteidentification AT olivaresgilalicia deeplearningandsupportvectormachinesfortranscriptionstartsiteidentification AT diezpastorjosef deeplearningandsupportvectormachinesfortranscriptionstartsiteidentification AT garciaosoriocesar deeplearningandsupportvectormachinesfortranscriptionstartsiteidentification

Deep learning and support vector machines for transcription start site identification

Ejemplares similares