Cargando…

Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features

BACKGROUND: DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explor...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Zhixun, Zhang, Xiaocai, Chen, Fang, Fang, Liang, Li, Jinyan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7488740/
https://www.ncbi.nlm.nih.gov/pubmed/32917152
http://dx.doi.org/10.1186/s12864-020-07033-8
_version_ 1783581755381383168
author Zhao, Zhixun
Zhang, Xiaocai
Chen, Fang
Fang, Liang
Li, Jinyan
author_facet Zhao, Zhixun
Zhang, Xiaocai
Chen, Fang
Fang, Liang
Li, Jinyan
author_sort Zhao, Zhixun
collection PubMed
description BACKGROUND: DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. RESULTS: The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. CONCLUSIONS: The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.
format Online
Article
Text
id pubmed-7488740
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-74887402020-09-16 Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features Zhao, Zhixun Zhang, Xiaocai Chen, Fang Fang, Liang Li, Jinyan BMC Genomics Research Article BACKGROUND: DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. RESULTS: The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. CONCLUSIONS: The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations. BioMed Central 2020-09-11 /pmc/articles/PMC7488740/ /pubmed/32917152 http://dx.doi.org/10.1186/s12864-020-07033-8 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Zhao, Zhixun
Zhang, Xiaocai
Chen, Fang
Fang, Liang
Li, Jinyan
Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features
title Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features
title_full Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features
title_fullStr Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features
title_full_unstemmed Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features
title_short Accurate prediction of DNA N(4)-methylcytosine sites via boost-learning various types of sequence features
title_sort accurate prediction of dna n(4)-methylcytosine sites via boost-learning various types of sequence features
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7488740/
https://www.ncbi.nlm.nih.gov/pubmed/32917152
http://dx.doi.org/10.1186/s12864-020-07033-8
work_keys_str_mv AT zhaozhixun accuratepredictionofdnan4methylcytosinesitesviaboostlearningvarioustypesofsequencefeatures
AT zhangxiaocai accuratepredictionofdnan4methylcytosinesitesviaboostlearningvarioustypesofsequencefeatures
AT chenfang accuratepredictionofdnan4methylcytosinesitesviaboostlearningvarioustypesofsequencefeatures
AT fangliang accuratepredictionofdnan4methylcytosinesitesviaboostlearningvarioustypesofsequencefeatures
AT lijinyan accuratepredictionofdnan4methylcytosinesitesviaboostlearningvarioustypesofsequencefeatures