Cargando…

A Contrastive Learning Pre-Training Method for Motif Occupancy Identification

Motif occupancy identification is a binary classification task predicting the binding of DNA motif instances to transcription factors, for which several sequence-based methods have been proposed. However, through direct training, these end-to-end methods are lack of biological interpretability withi...

Descripción completa

Detalles Bibliográficos
Autores principales: Lin, Ken, Quan, Xiongwen, Yin, Wenya, Zhang, Han
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9103107/
https://www.ncbi.nlm.nih.gov/pubmed/35563090
http://dx.doi.org/10.3390/ijms23094699
_version_ 1784707483649966080
author Lin, Ken
Quan, Xiongwen
Yin, Wenya
Zhang, Han
author_facet Lin, Ken
Quan, Xiongwen
Yin, Wenya
Zhang, Han
author_sort Lin, Ken
collection PubMed
description Motif occupancy identification is a binary classification task predicting the binding of DNA motif instances to transcription factors, for which several sequence-based methods have been proposed. However, through direct training, these end-to-end methods are lack of biological interpretability within their sequence representations. In this work, we propose a contrastive learning method to pre-train interpretable and robust DNA encoding for motif occupancy identification. We construct two alternative models to pre-train DNA sequential encoder, respectively: a self-supervised model and a supervised model. We augment the original sequences for contrastive learning with edit operations defined in edit distance. Specifically, we propose a sequence similarity criterion based on the Needleman–Wunsch algorithm to discriminate positive and negative sample pairs in self-supervised learning. Finally, a DNN classifier is fine-tuned along with the pre-trained encoder to predict the results of motif occupancy identification. Both proposed contrastive learning models outperform the baseline end-to-end CNN model and SimCLR method, reaching AUC of 0.811 and 0.823, respectively. Compared with the baseline method, our models show better robustness for small samples. Specifically, the self-supervised model is proved to be practicable in transfer learning.
format Online
Article
Text
id pubmed-9103107
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-91031072022-05-14 A Contrastive Learning Pre-Training Method for Motif Occupancy Identification Lin, Ken Quan, Xiongwen Yin, Wenya Zhang, Han Int J Mol Sci Article Motif occupancy identification is a binary classification task predicting the binding of DNA motif instances to transcription factors, for which several sequence-based methods have been proposed. However, through direct training, these end-to-end methods are lack of biological interpretability within their sequence representations. In this work, we propose a contrastive learning method to pre-train interpretable and robust DNA encoding for motif occupancy identification. We construct two alternative models to pre-train DNA sequential encoder, respectively: a self-supervised model and a supervised model. We augment the original sequences for contrastive learning with edit operations defined in edit distance. Specifically, we propose a sequence similarity criterion based on the Needleman–Wunsch algorithm to discriminate positive and negative sample pairs in self-supervised learning. Finally, a DNN classifier is fine-tuned along with the pre-trained encoder to predict the results of motif occupancy identification. Both proposed contrastive learning models outperform the baseline end-to-end CNN model and SimCLR method, reaching AUC of 0.811 and 0.823, respectively. Compared with the baseline method, our models show better robustness for small samples. Specifically, the self-supervised model is proved to be practicable in transfer learning. MDPI 2022-04-24 /pmc/articles/PMC9103107/ /pubmed/35563090 http://dx.doi.org/10.3390/ijms23094699 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Lin, Ken
Quan, Xiongwen
Yin, Wenya
Zhang, Han
A Contrastive Learning Pre-Training Method for Motif Occupancy Identification
title A Contrastive Learning Pre-Training Method for Motif Occupancy Identification
title_full A Contrastive Learning Pre-Training Method for Motif Occupancy Identification
title_fullStr A Contrastive Learning Pre-Training Method for Motif Occupancy Identification
title_full_unstemmed A Contrastive Learning Pre-Training Method for Motif Occupancy Identification
title_short A Contrastive Learning Pre-Training Method for Motif Occupancy Identification
title_sort contrastive learning pre-training method for motif occupancy identification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9103107/
https://www.ncbi.nlm.nih.gov/pubmed/35563090
http://dx.doi.org/10.3390/ijms23094699
work_keys_str_mv AT linken acontrastivelearningpretrainingmethodformotifoccupancyidentification
AT quanxiongwen acontrastivelearningpretrainingmethodformotifoccupancyidentification
AT yinwenya acontrastivelearningpretrainingmethodformotifoccupancyidentification
AT zhanghan acontrastivelearningpretrainingmethodformotifoccupancyidentification
AT linken contrastivelearningpretrainingmethodformotifoccupancyidentification
AT quanxiongwen contrastivelearningpretrainingmethodformotifoccupancyidentification
AT yinwenya contrastivelearningpretrainingmethodformotifoccupancyidentification
AT zhanghan contrastivelearningpretrainingmethodformotifoccupancyidentification