Cargando…

Automatic query generation using word embeddings for retrieving passages describing experimental methods

Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of...

Descripción completa

Detalles Bibliográficos
Autores principales: Aydın, Ferhat, Hüsünbeyi, Zehra Melce, Özgür, Arzucan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225401/
https://www.ncbi.nlm.nih.gov/pubmed/28077568
http://dx.doi.org/10.1093/database/baw166
_version_ 1782493498043269120
author Aydın, Ferhat
Hüsünbeyi, Zehra Melce
Özgür, Arzucan
author_facet Aydın, Ferhat
Hüsünbeyi, Zehra Melce
Özgür, Arzucan
author_sort Aydın, Ferhat
collection PubMed
description Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article. Database URL: https://github.com/ferhtaydn/biocemid/
format Online
Article
Text
id pubmed-5225401
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-52254012017-01-18 Automatic query generation using word embeddings for retrieving passages describing experimental methods Aydın, Ferhat Hüsünbeyi, Zehra Melce Özgür, Arzucan Database (Oxford) Original Article Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article. Database URL: https://github.com/ferhtaydn/biocemid/ Oxford University Press 2017-01-10 /pmc/articles/PMC5225401/ /pubmed/28077568 http://dx.doi.org/10.1093/database/baw166 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Aydın, Ferhat
Hüsünbeyi, Zehra Melce
Özgür, Arzucan
Automatic query generation using word embeddings for retrieving passages describing experimental methods
title Automatic query generation using word embeddings for retrieving passages describing experimental methods
title_full Automatic query generation using word embeddings for retrieving passages describing experimental methods
title_fullStr Automatic query generation using word embeddings for retrieving passages describing experimental methods
title_full_unstemmed Automatic query generation using word embeddings for retrieving passages describing experimental methods
title_short Automatic query generation using word embeddings for retrieving passages describing experimental methods
title_sort automatic query generation using word embeddings for retrieving passages describing experimental methods
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225401/
https://www.ncbi.nlm.nih.gov/pubmed/28077568
http://dx.doi.org/10.1093/database/baw166
work_keys_str_mv AT aydınferhat automaticquerygenerationusingwordembeddingsforretrievingpassagesdescribingexperimentalmethods
AT husunbeyizehramelce automaticquerygenerationusingwordembeddingsforretrievingpassagesdescribingexperimentalmethods
AT ozgurarzucan automaticquerygenerationusingwordembeddingsforretrievingpassagesdescribingexperimentalmethods