Cargando…

Investigating heterogeneous protein annotations toward cross-corpora utilization

BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However,...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Yue, Kim, Jin-Dong, Sætre, Rune, Pyysalo, Sampo, Tsujii, Jun'ichi
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2804683/
https://www.ncbi.nlm.nih.gov/pubmed/19995463
http://dx.doi.org/10.1186/1471-2105-10-403
_version_ 1782176173820739584
author Wang, Yue
Kim, Jin-Dong
Sætre, Rune
Pyysalo, Sampo
Tsujii, Jun'ichi
author_facet Wang, Yue
Kim, Jin-Dong
Sætre, Rune
Pyysalo, Sampo
Tsujii, Jun'ichi
author_sort Wang, Yue
collection PubMed
description BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. RESULTS: We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. CONCLUSION: Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.
format Text
id pubmed-2804683
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28046832010-01-12 Investigating heterogeneous protein annotations toward cross-corpora utilization Wang, Yue Kim, Jin-Dong Sætre, Rune Pyysalo, Sampo Tsujii, Jun'ichi BMC Bioinformatics Research article BACKGROUND: The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. RESULTS: We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. CONCLUSION: Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora. BioMed Central 2009-12-09 /pmc/articles/PMC2804683/ /pubmed/19995463 http://dx.doi.org/10.1186/1471-2105-10-403 Text en Copyright ©2009 Wang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Wang, Yue
Kim, Jin-Dong
Sætre, Rune
Pyysalo, Sampo
Tsujii, Jun'ichi
Investigating heterogeneous protein annotations toward cross-corpora utilization
title Investigating heterogeneous protein annotations toward cross-corpora utilization
title_full Investigating heterogeneous protein annotations toward cross-corpora utilization
title_fullStr Investigating heterogeneous protein annotations toward cross-corpora utilization
title_full_unstemmed Investigating heterogeneous protein annotations toward cross-corpora utilization
title_short Investigating heterogeneous protein annotations toward cross-corpora utilization
title_sort investigating heterogeneous protein annotations toward cross-corpora utilization
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2804683/
https://www.ncbi.nlm.nih.gov/pubmed/19995463
http://dx.doi.org/10.1186/1471-2105-10-403
work_keys_str_mv AT wangyue investigatingheterogeneousproteinannotationstowardcrosscorporautilization
AT kimjindong investigatingheterogeneousproteinannotationstowardcrosscorporautilization
AT sætrerune investigatingheterogeneousproteinannotationstowardcrosscorporautilization
AT pyysalosampo investigatingheterogeneousproteinannotationstowardcrosscorporautilization
AT tsujiijunichi investigatingheterogeneousproteinannotationstowardcrosscorporautilization