Cargando…

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic inter...

Descripción completa

Detalles Bibliográficos
Autores principales:	David, Rakesh, Menezes, Rhys-Joshua D., De Klerk, Jan, Castleden, Ian R., Hooper, Cornelia M., Carneiro, Gustavo, Gilliham, Matthew
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7813825/ https://www.ncbi.nlm.nih.gov/pubmed/33462256 http://dx.doi.org/10.1038/s41598-020-80441-8

_version_	1783637936543105024
author	David, Rakesh Menezes, Rhys-Joshua D. De Klerk, Jan Castleden, Ian R. Hooper, Cornelia M. Carneiro, Gustavo Gilliham, Matthew
author_facet	David, Rakesh Menezes, Rhys-Joshua D. De Klerk, Jan Castleden, Ian R. Hooper, Cornelia M. Carneiro, Gustavo Gilliham, Matthew
author_sort	David, Rakesh
collection	PubMed
description	The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.
format	Online Article Text
id	pubmed-7813825
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-78138252021-01-21 Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network David, Rakesh Menezes, Rhys-Joshua D. De Klerk, Jan Castleden, Ian R. Hooper, Cornelia M. Carneiro, Gustavo Gilliham, Matthew Sci Rep Article The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses. Nature Publishing Group UK 2021-01-18 /pmc/articles/PMC7813825/ /pubmed/33462256 http://dx.doi.org/10.1038/s41598-020-80441-8 Text en © The Author(s) 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle	Article David, Rakesh Menezes, Rhys-Joshua D. De Klerk, Jan Castleden, Ian R. Hooper, Cornelia M. Carneiro, Gustavo Gilliham, Matthew Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_full	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_fullStr	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_full_unstemmed	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_short	Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
title_sort	identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7813825/ https://www.ncbi.nlm.nih.gov/pubmed/33462256 http://dx.doi.org/10.1038/s41598-020-80441-8
work_keys_str_mv	AT davidrakesh identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT menezesrhysjoshuad identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT deklerkjan identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT castledenianr identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT hoopercorneliam identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT carneirogustavo identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork AT gillihammatthew identifyingproteinsubcellularlocalisationinscientificliteratureusingbidirectionaldeeprecurrentneuralnetwork

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

Ejemplares similares