Cargando…

MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction

Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed t...

Descripción completa

Detalles Bibliográficos
Autores principales: Zeng, Wenhuan, Gautam, Anupam, Huson, Daniel H
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367125/
https://www.ncbi.nlm.nih.gov/pubmed/37489753
http://dx.doi.org/10.1093/gigascience/giad054
_version_ 1785077320040579072
author Zeng, Wenhuan
Gautam, Anupam
Huson, Daniel H
author_facet Zeng, Wenhuan
Gautam, Anupam
Huson, Daniel H
author_sort Zeng, Wenhuan
collection PubMed
description Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pretrain and fine-tune” paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
format Online
Article
Text
id pubmed-10367125
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103671252023-07-26 MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction Zeng, Wenhuan Gautam, Anupam Huson, Daniel H Gigascience Research Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning–based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pretrain and fine-tune” paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach. Oxford University Press 2023-07-25 /pmc/articles/PMC10367125/ /pubmed/37489753 http://dx.doi.org/10.1093/gigascience/giad054 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Zeng, Wenhuan
Gautam, Anupam
Huson, Daniel H
MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
title MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
title_full MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
title_fullStr MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
title_full_unstemmed MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
title_short MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction
title_sort mulan-methyl—multiple transformer-based language models for accurate dna methylation prediction
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10367125/
https://www.ncbi.nlm.nih.gov/pubmed/37489753
http://dx.doi.org/10.1093/gigascience/giad054
work_keys_str_mv AT zengwenhuan mulanmethylmultipletransformerbasedlanguagemodelsforaccuratednamethylationprediction
AT gautamanupam mulanmethylmultipletransformerbasedlanguagemodelsforaccuratednamethylationprediction
AT husondanielh mulanmethylmultipletransformerbasedlanguagemodelsforaccuratednamethylationprediction