Cargando…
MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed t...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367125/ https://www.ncbi.nlm.nih.gov/pubmed/37489753 http://dx.doi.org/10.1093/gigascience/giad054 |
_version_ | 1785077320040579072 |
---|---|
author | Zeng, Wenhuan Gautam, Anupam Huson, Daniel H |
author_facet | Zeng, Wenhuan Gautam, Anupam Huson, Daniel H |
author_sort | Zeng, Wenhuan |
collection | PubMed |
description | Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pretrain and fine-tune” paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach. |
format | Online Article Text |
id | pubmed-10367125 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-103671252023-07-26 MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction Zeng, Wenhuan Gautam, Anupam Huson, Daniel H Gigascience Research Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pretrain and fine-tune” paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach. Oxford University Press 2023-07-25 /pmc/articles/PMC10367125/ /pubmed/37489753 http://dx.doi.org/10.1093/gigascience/giad054 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Zeng, Wenhuan Gautam, Anupam Huson, Daniel H MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction |
title | MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction |
title_full | MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction |
title_fullStr | MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction |
title_full_unstemmed | MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction |
title_short | MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction |
title_sort | mulan-methyl—multiple transformer-based language models for accurate dna methylation prediction |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367125/ https://www.ncbi.nlm.nih.gov/pubmed/37489753 http://dx.doi.org/10.1093/gigascience/giad054 |
work_keys_str_mv | AT zengwenhuan mulanmethylmultipletransformerbasedlanguagemodelsforaccuratednamethylationprediction AT gautamanupam mulanmethylmultipletransformerbasedlanguagemodelsforaccuratednamethylationprediction AT husondanielh mulanmethylmultipletransformerbasedlanguagemodelsforaccuratednamethylationprediction |