Cargando…
Automatic query generation using word embeddings for retrieving passages describing experimental methods
Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225401/ https://www.ncbi.nlm.nih.gov/pubmed/28077568 http://dx.doi.org/10.1093/database/baw166 |
_version_ | 1782493498043269120 |
---|---|
author | Aydın, Ferhat Hüsünbeyi, Zehra Melce Özgür, Arzucan |
author_facet | Aydın, Ferhat Hüsünbeyi, Zehra Melce Özgür, Arzucan |
author_sort | Aydın, Ferhat |
collection | PubMed |
description | Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article. Database URL: https://github.com/ferhtaydn/biocemid/ |
format | Online Article Text |
id | pubmed-5225401 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-52254012017-01-18 Automatic query generation using word embeddings for retrieving passages describing experimental methods Aydın, Ferhat Hüsünbeyi, Zehra Melce Özgür, Arzucan Database (Oxford) Original Article Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article. Database URL: https://github.com/ferhtaydn/biocemid/ Oxford University Press 2017-01-10 /pmc/articles/PMC5225401/ /pubmed/28077568 http://dx.doi.org/10.1093/database/baw166 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Aydın, Ferhat Hüsünbeyi, Zehra Melce Özgür, Arzucan Automatic query generation using word embeddings for retrieving passages describing experimental methods |
title | Automatic query generation using word embeddings for retrieving passages describing experimental methods |
title_full | Automatic query generation using word embeddings for retrieving passages describing experimental methods |
title_fullStr | Automatic query generation using word embeddings for retrieving passages describing experimental methods |
title_full_unstemmed | Automatic query generation using word embeddings for retrieving passages describing experimental methods |
title_short | Automatic query generation using word embeddings for retrieving passages describing experimental methods |
title_sort | automatic query generation using word embeddings for retrieving passages describing experimental methods |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225401/ https://www.ncbi.nlm.nih.gov/pubmed/28077568 http://dx.doi.org/10.1093/database/baw166 |
work_keys_str_mv | AT aydınferhat automaticquerygenerationusingwordembeddingsforretrievingpassagesdescribingexperimentalmethods AT husunbeyizehramelce automaticquerygenerationusingwordembeddingsforretrievingpassagesdescribingexperimentalmethods AT ozgurarzucan automaticquerygenerationusingwordembeddingsforretrievingpassagesdescribingexperimentalmethods |