Cargando…

Protein Name Tagging Guidelines: Lessons Learned

Interest in information extraction from the biomedical literature is motivated by the need to speed up the creation of structured databases representing the latest scientific knowledge about specific objects, such as proteins and genes. This paper addresses the issue of a lack of standard definition...

Descripción completa

Detalles Bibliográficos
Autores principales: Mani, Inderjeet, Hu, Zhangzhi, Jang, Seok Bae, Samuel, Ken, Krause, Matthew, Phillips, Jon, Wu, Cathy H.
Formato: Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2448601/
https://www.ncbi.nlm.nih.gov/pubmed/18629297
http://dx.doi.org/10.1002/cfg.452
_version_ 1782157179072020480
author Mani, Inderjeet
Hu, Zhangzhi
Jang, Seok Bae
Samuel, Ken
Krause, Matthew
Phillips, Jon
Wu, Cathy H.
author_facet Mani, Inderjeet
Hu, Zhangzhi
Jang, Seok Bae
Samuel, Ken
Krause, Matthew
Phillips, Jon
Wu, Cathy H.
author_sort Mani, Inderjeet
collection PubMed
description Interest in information extraction from the biomedical literature is motivated by the need to speed up the creation of structured databases representing the latest scientific knowledge about specific objects, such as proteins and genes. This paper addresses the issue of a lack of standard definition of the problem of protein name tagging. We describe the lessons learned in developing a set of guidelines and present the first set of inter-coder results, viewed as an upper bound on system performance. Problems coders face include: (a) the ambiguity of names that can refer to either genes or proteins; (b) the difficulty of getting the exact extents of long protein names; and (c) the complexity of the guidelines. These problems have been addressed in two ways: (a) defining the tagging targets as protein named entities used in the literature to describe proteins or protein-associated or -related objects, such as domains, pathways, expression or genes, and (b) using two types of tags, protein tags and long-form tags, with the latter being used to optionally extend the boundaries of the protein tag when the name boundary is difficult to determine. Inter-coder consistency across three annotators on protein tags on 300 MEDLINE abstracts is 0.868 F-measure. The guidelines and annotated datasets, along with automatic tools, are available for research use.
format Text
id pubmed-2448601
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-24486012008-07-14 Protein Name Tagging Guidelines: Lessons Learned Mani, Inderjeet Hu, Zhangzhi Jang, Seok Bae Samuel, Ken Krause, Matthew Phillips, Jon Wu, Cathy H. Comp Funct Genomics Research Article Interest in information extraction from the biomedical literature is motivated by the need to speed up the creation of structured databases representing the latest scientific knowledge about specific objects, such as proteins and genes. This paper addresses the issue of a lack of standard definition of the problem of protein name tagging. We describe the lessons learned in developing a set of guidelines and present the first set of inter-coder results, viewed as an upper bound on system performance. Problems coders face include: (a) the ambiguity of names that can refer to either genes or proteins; (b) the difficulty of getting the exact extents of long protein names; and (c) the complexity of the guidelines. These problems have been addressed in two ways: (a) defining the tagging targets as protein named entities used in the literature to describe proteins or protein-associated or -related objects, such as domains, pathways, expression or genes, and (b) using two types of tags, protein tags and long-form tags, with the latter being used to optionally extend the boundaries of the protein tag when the name boundary is difficult to determine. Inter-coder consistency across three annotators on protein tags on 300 MEDLINE abstracts is 0.868 F-measure. The guidelines and annotated datasets, along with automatic tools, are available for research use. Hindawi Publishing Corporation 2005 /pmc/articles/PMC2448601/ /pubmed/18629297 http://dx.doi.org/10.1002/cfg.452 Text en Copyright © 2005 Hindawi Publishing Corporation. http://creativecommons.org/licenses/by/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Mani, Inderjeet
Hu, Zhangzhi
Jang, Seok Bae
Samuel, Ken
Krause, Matthew
Phillips, Jon
Wu, Cathy H.
Protein Name Tagging Guidelines: Lessons Learned
title Protein Name Tagging Guidelines: Lessons Learned
title_full Protein Name Tagging Guidelines: Lessons Learned
title_fullStr Protein Name Tagging Guidelines: Lessons Learned
title_full_unstemmed Protein Name Tagging Guidelines: Lessons Learned
title_short Protein Name Tagging Guidelines: Lessons Learned
title_sort protein name tagging guidelines: lessons learned
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2448601/
https://www.ncbi.nlm.nih.gov/pubmed/18629297
http://dx.doi.org/10.1002/cfg.452
work_keys_str_mv AT maniinderjeet proteinnametaggingguidelineslessonslearned
AT huzhangzhi proteinnametaggingguidelineslessonslearned
AT jangseokbae proteinnametaggingguidelineslessonslearned
AT samuelken proteinnametaggingguidelineslessonslearned
AT krausematthew proteinnametaggingguidelineslessonslearned
AT phillipsjon proteinnametaggingguidelineslessonslearned
AT wucathyh proteinnametaggingguidelineslessonslearned