Cargando…

Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach

BACKGROUND: Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach...

Descripción completa

Detalles Bibliográficos
Autores principales: Qu, Jinchan, Steppi, Albert, Zhong, Dongrui, Hao, Jie, Wang, Jian, Lung, Pei-Yau, Zhao, Tingting, He, Zhe, Zhang, Jinfeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7654050/
https://www.ncbi.nlm.nih.gov/pubmed/33167858
http://dx.doi.org/10.1186/s12864-020-07185-7
_version_ 1783608000910458880
author Qu, Jinchan
Steppi, Albert
Zhong, Dongrui
Hao, Jie
Wang, Jian
Lung, Pei-Yau
Zhao, Tingting
He, Zhe
Zhang, Jinfeng
author_facet Qu, Jinchan
Steppi, Albert
Zhong, Dongrui
Hao, Jie
Wang, Jian
Lung, Pei-Yau
Zhao, Tingting
He, Zhe
Zhang, Jinfeng
author_sort Qu, Jinchan
collection PubMed
description BACKGROUND: Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS: Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS: The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.
format Online
Article
Text
id pubmed-7654050
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-76540502020-11-10 Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach Qu, Jinchan Steppi, Albert Zhong, Dongrui Hao, Jie Wang, Jian Lung, Pei-Yau Zhao, Tingting He, Zhe Zhang, Jinfeng BMC Genomics Methodology Article BACKGROUND: Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS: Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS: The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods. BioMed Central 2020-11-10 /pmc/articles/PMC7654050/ /pubmed/33167858 http://dx.doi.org/10.1186/s12864-020-07185-7 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Qu, Jinchan
Steppi, Albert
Zhong, Dongrui
Hao, Jie
Wang, Jian
Lung, Pei-Yau
Zhao, Tingting
He, Zhe
Zhang, Jinfeng
Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach
title Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach
title_full Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach
title_fullStr Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach
title_full_unstemmed Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach
title_short Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach
title_sort triage of documents containing protein interactions affected by mutations using an nlp based machine learning approach
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7654050/
https://www.ncbi.nlm.nih.gov/pubmed/33167858
http://dx.doi.org/10.1186/s12864-020-07185-7
work_keys_str_mv AT qujinchan triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT steppialbert triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT zhongdongrui triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT haojie triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT wangjian triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT lungpeiyau triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT zhaotingting triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT hezhe triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach
AT zhangjinfeng triageofdocumentscontainingproteininteractionsaffectedbymutationsusingannlpbasedmachinelearningapproach