Cargando…
iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
BACKGROUND: Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes signif...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9531353/ https://www.ncbi.nlm.nih.gov/pubmed/36192696 http://dx.doi.org/10.1186/s12864-022-08829-6 |
_version_ | 1784801884173762560 |
---|---|
author | Nguyen-Vo, Thanh-Hoang Trinh, Quang H. Nguyen, Loc Nguyen-Hoang, Phuong-Uyen Rahardja, Susanto Nguyen, Binh P. |
author_facet | Nguyen-Vo, Thanh-Hoang Trinh, Quang H. Nguyen, Loc Nguyen-Hoang, Phuong-Uyen Rahardja, Susanto Nguyen, Binh P. |
author_sort | Nguyen-Vo, Thanh-Hoang |
collection | PubMed |
description | BACKGROUND: Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec – an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. RESULTS: The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. CONCLUSIONS: iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-022-08829-6. |
format | Online Article Text |
id | pubmed-9531353 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-95313532022-10-05 iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features Nguyen-Vo, Thanh-Hoang Trinh, Quang H. Nguyen, Loc Nguyen-Hoang, Phuong-Uyen Rahardja, Susanto Nguyen, Binh P. BMC Genomics Research BACKGROUND: Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec – an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. RESULTS: The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. CONCLUSIONS: iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-022-08829-6. BioMed Central 2022-10-03 /pmc/articles/PMC9531353/ /pubmed/36192696 http://dx.doi.org/10.1186/s12864-022-08829-6 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Nguyen-Vo, Thanh-Hoang Trinh, Quang H. Nguyen, Loc Nguyen-Hoang, Phuong-Uyen Rahardja, Susanto Nguyen, Binh P. iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features |
title | iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features |
title_full | iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features |
title_fullStr | iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features |
title_full_unstemmed | iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features |
title_short | iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features |
title_sort | ipromoter-seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9531353/ https://www.ncbi.nlm.nih.gov/pubmed/36192696 http://dx.doi.org/10.1186/s12864-022-08829-6 |
work_keys_str_mv | AT nguyenvothanhhoang ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures AT trinhquangh ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures AT nguyenloc ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures AT nguyenhoangphuonguyen ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures AT rahardjasusanto ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures AT nguyenbinhp ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures |