Cargando…
Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation()
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focu...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Research Network of Computational and Structural Biotechnology
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042287/ https://www.ncbi.nlm.nih.gov/pubmed/33868598 http://dx.doi.org/10.1016/j.csbj.2021.03.015 |
_version_ | 1783678094427553792 |
---|---|
author | Khanal, Jhabindra Tayara, Hilal Zou, Quan Chong, Kil To |
author_facet | Khanal, Jhabindra Tayara, Hilal Zou, Quan Chong, Kil To |
author_sort | Khanal, Jhabindra |
collection | PubMed |
description | DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/. |
format | Online Article Text |
id | pubmed-8042287 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Research Network of Computational and Structural Biotechnology |
record_format | MEDLINE/PubMed |
spelling | pubmed-80422872021-04-16 Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() Khanal, Jhabindra Tayara, Hilal Zou, Quan Chong, Kil To Comput Struct Biotechnol J Research Article DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/. Research Network of Computational and Structural Biotechnology 2021-03-19 /pmc/articles/PMC8042287/ /pubmed/33868598 http://dx.doi.org/10.1016/j.csbj.2021.03.015 Text en © 2021 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Research Article Khanal, Jhabindra Tayara, Hilal Zou, Quan Chong, Kil To Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() |
title | Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() |
title_full | Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() |
title_fullStr | Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() |
title_full_unstemmed | Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() |
title_short | Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() |
title_sort | identifying dna n4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation() |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042287/ https://www.ncbi.nlm.nih.gov/pubmed/33868598 http://dx.doi.org/10.1016/j.csbj.2021.03.015 |
work_keys_str_mv | AT khanaljhabindra identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation AT tayarahilal identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation AT zouquan identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation AT chongkilto identifyingdnan4methylcytosinesitesintherosaceaegenomewithadeeplearningmodelrelyingondistributedfeaturerepresentation |