Cargando…

Comparative analysis of five protein-protein interaction corpora

BACKGROUND: Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently re...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pyysalo, Sampo, Airola, Antti, Heimonen, Juho, Björne, Jari, Ginter, Filip, Salakoski, Tapio
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2349296/ https://www.ncbi.nlm.nih.gov/pubmed/18426551 http://dx.doi.org/10.1186/1471-2105-9-S3-S6

_version_	1782152850516738048
author	Pyysalo, Sampo Airola, Antti Heimonen, Juho Björne, Jari Ginter, Filip Salakoski, Tapio
author_facet	Pyysalo, Sampo Airola, Antti Heimonen, Juho Björne, Jari Ginter, Filip Salakoski, Tapio
author_sort	Pyysalo, Sampo
collection	PubMed
description	BACKGROUND: Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate. RESULTS: We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty. We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora. CONCLUSIONS: Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at .
format	Text
id	pubmed-2349296
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-23492962008-04-29 Comparative analysis of five protein-protein interaction corpora Pyysalo, Sampo Airola, Antti Heimonen, Juho Björne, Jari Ginter, Filip Salakoski, Tapio BMC Bioinformatics Proceedings BACKGROUND: Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate. RESULTS: We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty. We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora. CONCLUSIONS: Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at . BioMed Central 2008-04-11 /pmc/articles/PMC2349296/ /pubmed/18426551 http://dx.doi.org/10.1186/1471-2105-9-S3-S6 Text en Copyright © 2008 Pyysalo et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Pyysalo, Sampo Airola, Antti Heimonen, Juho Björne, Jari Ginter, Filip Salakoski, Tapio Comparative analysis of five protein-protein interaction corpora
title	Comparative analysis of five protein-protein interaction corpora
title_full	Comparative analysis of five protein-protein interaction corpora
title_fullStr	Comparative analysis of five protein-protein interaction corpora
title_full_unstemmed	Comparative analysis of five protein-protein interaction corpora
title_short	Comparative analysis of five protein-protein interaction corpora
title_sort	comparative analysis of five protein-protein interaction corpora
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2349296/ https://www.ncbi.nlm.nih.gov/pubmed/18426551 http://dx.doi.org/10.1186/1471-2105-9-S3-S6
work_keys_str_mv	AT pyysalosampo comparativeanalysisoffiveproteinproteininteractioncorpora AT airolaantti comparativeanalysisoffiveproteinproteininteractioncorpora AT heimonenjuho comparativeanalysisoffiveproteinproteininteractioncorpora AT bjornejari comparativeanalysisoffiveproteinproteininteractioncorpora AT ginterfilip comparativeanalysisoffiveproteinproteininteractioncorpora AT salakoskitapio comparativeanalysisoffiveproteinproteininteractioncorpora

Comparative analysis of five protein-protein interaction corpora

Ejemplares similares