Cargando…

Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model

Accurately identifying the missense mutations is of great help to alleviate the loss of protein function and structural changes, which might greatly reduce the risk of disease for tumor suppressor genes (e.g., BRCA1 and PTEN). In this paper, we propose a hybrid framework, called BertVS, that predict...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Kuan, Zhong, Yue, Lin, Xuan, Quan, Zhe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780924/
https://www.ncbi.nlm.nih.gov/pubmed/33408741
http://dx.doi.org/10.3389/fgene.2020.605620
_version_ 1783631594787962880
author Li, Kuan
Zhong, Yue
Lin, Xuan
Quan, Zhe
author_facet Li, Kuan
Zhong, Yue
Lin, Xuan
Quan, Zhe
author_sort Li, Kuan
collection PubMed
description Accurately identifying the missense mutations is of great help to alleviate the loss of protein function and structural changes, which might greatly reduce the risk of disease for tumor suppressor genes (e.g., BRCA1 and PTEN). In this paper, we propose a hybrid framework, called BertVS, that predicts the disease risk for the missense mutation of proteins. Our framework is able to learn sequence representations from the protein domain through pre-training BERT models, and also integrates with the hydrophilic properties of amino acids to obtain the sequence representations of biochemical characteristics. The concatenation of two learned representations are then sent to the classifier to predict the missense mutations of protein sequences. Specifically, we use the protein family database (Pfam) as a corpus to train the BERT model to learn the contextual information of protein sequences, and our pre-training BERT model achieves a value of 0.984 on accuracy in the masked language model prediction task. We conduct extensive experiments on BRCA1 and PTEN datasets. With comparison to the baselines, results show that BertVS achieves higher performance of 0.920 on AUROC and 0.915 on AUPR in the functionally critical domain of the BRCA1 gene. Additionally, the extended experiment on the ClinVar dataset can illustrate that gene variants with known clinical significance can also be efficiently classified by our method. Therefore, BertVS can learn the functional information of the protein sequences and effectively predict the disease risk of variants with an uncertain clinical significance.
format Online
Article
Text
id pubmed-7780924
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-77809242021-01-05 Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model Li, Kuan Zhong, Yue Lin, Xuan Quan, Zhe Front Genet Genetics Accurately identifying the missense mutations is of great help to alleviate the loss of protein function and structural changes, which might greatly reduce the risk of disease for tumor suppressor genes (e.g., BRCA1 and PTEN). In this paper, we propose a hybrid framework, called BertVS, that predicts the disease risk for the missense mutation of proteins. Our framework is able to learn sequence representations from the protein domain through pre-training BERT models, and also integrates with the hydrophilic properties of amino acids to obtain the sequence representations of biochemical characteristics. The concatenation of two learned representations are then sent to the classifier to predict the missense mutations of protein sequences. Specifically, we use the protein family database (Pfam) as a corpus to train the BERT model to learn the contextual information of protein sequences, and our pre-training BERT model achieves a value of 0.984 on accuracy in the masked language model prediction task. We conduct extensive experiments on BRCA1 and PTEN datasets. With comparison to the baselines, results show that BertVS achieves higher performance of 0.920 on AUROC and 0.915 on AUPR in the functionally critical domain of the BRCA1 gene. Additionally, the extended experiment on the ClinVar dataset can illustrate that gene variants with known clinical significance can also be efficiently classified by our method. Therefore, BertVS can learn the functional information of the protein sequences and effectively predict the disease risk of variants with an uncertain clinical significance. Frontiers Media S.A. 2020-12-21 /pmc/articles/PMC7780924/ /pubmed/33408741 http://dx.doi.org/10.3389/fgene.2020.605620 Text en Copyright © 2020 Li, Zhong, Lin and Quan. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Li, Kuan
Zhong, Yue
Lin, Xuan
Quan, Zhe
Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model
title Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model
title_full Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model
title_fullStr Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model
title_full_unstemmed Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model
title_short Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model
title_sort predicting the disease risk of protein mutation sequences with pre-training model
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7780924/
https://www.ncbi.nlm.nih.gov/pubmed/33408741
http://dx.doi.org/10.3389/fgene.2020.605620
work_keys_str_mv AT likuan predictingthediseaseriskofproteinmutationsequenceswithpretrainingmodel
AT zhongyue predictingthediseaseriskofproteinmutationsequenceswithpretrainingmodel
AT linxuan predictingthediseaseriskofproteinmutationsequenceswithpretrainingmodel
AT quanzhe predictingthediseaseriskofproteinmutationsequenceswithpretrainingmodel