Cargando…

A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers

Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detect...

Descripción completa

Detalles Bibliográficos
Autores principales: Mu, Xuechen, Wang, Yueying, Duan, Meiyu, Liu, Shuai, Li, Fei, Wang, Xiuli, Zhang, Kai, Huang, Lan, Zhou, Fengfeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8002641/
https://www.ncbi.nlm.nih.gov/pubmed/33802922
http://dx.doi.org/10.3390/ijms22063079
_version_ 1783671511030169600
author Mu, Xuechen
Wang, Yueying
Duan, Meiyu
Liu, Shuai
Li, Fei
Wang, Xiuli
Zhang, Kai
Huang, Lan
Zhou, Fengfeng
author_facet Mu, Xuechen
Wang, Yueying
Duan, Meiyu
Liu, Shuai
Li, Fei
Wang, Xiuli
Zhang, Kai
Huang, Lan
Zhou, Fengfeng
author_sort Mu, Xuechen
collection PubMed
description Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset.
format Online
Article
Text
id pubmed-8002641
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-80026412021-03-28 A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers Mu, Xuechen Wang, Yueying Duan, Meiyu Liu, Shuai Li, Fei Wang, Xiuli Zhang, Kai Huang, Lan Zhou, Fengfeng Int J Mol Sci Article Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset. MDPI 2021-03-17 /pmc/articles/PMC8002641/ /pubmed/33802922 http://dx.doi.org/10.3390/ijms22063079 Text en © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Mu, Xuechen
Wang, Yueying
Duan, Meiyu
Liu, Shuai
Li, Fei
Wang, Xiuli
Zhang, Kai
Huang, Lan
Zhou, Fengfeng
A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers
title A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers
title_full A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers
title_fullStr A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers
title_full_unstemmed A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers
title_short A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers
title_sort novel position-specific encoding algorithm (seqpose) of nucleotide sequences and its application for detecting enhancers
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8002641/
https://www.ncbi.nlm.nih.gov/pubmed/33802922
http://dx.doi.org/10.3390/ijms22063079
work_keys_str_mv AT muxuechen anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT wangyueying anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT duanmeiyu anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT liushuai anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT lifei anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT wangxiuli anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT zhangkai anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT huanglan anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT zhoufengfeng anovelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT muxuechen novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT wangyueying novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT duanmeiyu novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT liushuai novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT lifei novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT wangxiuli novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT zhangkai novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT huanglan novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers
AT zhoufengfeng novelpositionspecificencodingalgorithmseqposeofnucleotidesequencesanditsapplicationfordetectingenhancers