Cargando…

Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()

DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focu...

Descripción completa

Detalles Bibliográficos
Autores principales: Khanal, Jhabindra, Tayara, Hilal, Zou, Quan, Chong, Kil To
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042287/
https://www.ncbi.nlm.nih.gov/pubmed/33868598
http://dx.doi.org/10.1016/j.csbj.2021.03.015
_version_ 1783678094427553792
author Khanal, Jhabindra
Tayara, Hilal
Zou, Quan
Chong, Kil To
author_facet Khanal, Jhabindra
Tayara, Hilal
Zou, Quan
Chong, Kil To
author_sort Khanal, Jhabindra
collection PubMed
description DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
format Online
Article
Text
id pubmed-8042287
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-80422872021-04-16 Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() Khanal, Jhabindra Tayara, Hilal Zou, Quan Chong, Kil To Comput Struct Biotechnol J Research Article DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/. Research Network of Computational and Structural Biotechnology 2021-03-19 /pmc/articles/PMC8042287/ /pubmed/33868598 http://dx.doi.org/10.1016/j.csbj.2021.03.015 Text en © 2021 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Research Article
Khanal, Jhabindra
Tayara, Hilal
Zou, Quan
Chong, Kil To
Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
title Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
title_full Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
title_fullStr Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
title_full_unstemmed Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
title_short Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
title_sort identifying dna n4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042287/
https://www.ncbi.nlm.nih.gov/pubmed/33868598
http://dx.doi.org/10.1016/j.csbj.2021.03.015
work_keys_str_mv AT khanaljhabindra identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation
AT tayarahilal identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation
AT zouquan identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation
AT chongkilto identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation