Cargando…

Data preparation and interannotator agreement: BioCreAtIvE Task 1B

BACKGROUND: We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organis...

Descripción completa

Detalles Bibliográficos
Autores principales: Colosimo, Marc E, Morgan, Alexander A, Yeh, Alexander S, Colombe, Jeffrey B, Hirschman, Lynette
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869005/
https://www.ncbi.nlm.nih.gov/pubmed/15960824
http://dx.doi.org/10.1186/1471-2105-6-S1-S12
_version_ 1782133425166090240
author Colosimo, Marc E
Morgan, Alexander A
Yeh, Alexander S
Colombe, Jeffrey B
Hirschman, Lynette
author_facet Colosimo, Marc E
Morgan, Alexander A
Yeh, Alexander S
Colombe, Jeffrey B
Hirschman, Lynette
author_sort Colosimo, Marc E
collection PubMed
description BACKGROUND: We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset. RESULTS: Interannotator analysis on a small dataset showed that our gene lists for Fly and Yeast were good (87% and 91% three-way agreement) but the Mouse gene list had many conflicts (mostly omissions), which resulted in errors (69% interannotator agreement). By comparing and pooling answers from the participant systems, we were able to add an additional check on the test data; this allowed us to find additional errors, especially in Mouse. This led to 1% change in the Yeast and Fly "gold standard" answer keys, but to an 8% change in the mouse answer key. CONCLUSION: We found that clear annotation guidelines are important, along with careful interannotator experiments, to validate the generated gene lists. Also, abstracts alone are a poor resource for identifying genes in paper, containing only a fraction of genes mentioned in the full text (25% for Fly, 36% for Mouse). We found that there are intrinsic differences between the model organism databases related to the number of synonymous terms and also to curation criteria. Finally, we found that answer pooling was much faster and allowed us to identify more conflicting genes than interannotator analysis.
format Text
id pubmed-1869005
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18690052007-05-18 Data preparation and interannotator agreement: BioCreAtIvE Task 1B Colosimo, Marc E Morgan, Alexander A Yeh, Alexander S Colombe, Jeffrey B Hirschman, Lynette BMC Bioinformatics Report BACKGROUND: We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset. RESULTS: Interannotator analysis on a small dataset showed that our gene lists for Fly and Yeast were good (87% and 91% three-way agreement) but the Mouse gene list had many conflicts (mostly omissions), which resulted in errors (69% interannotator agreement). By comparing and pooling answers from the participant systems, we were able to add an additional check on the test data; this allowed us to find additional errors, especially in Mouse. This led to 1% change in the Yeast and Fly "gold standard" answer keys, but to an 8% change in the mouse answer key. CONCLUSION: We found that clear annotation guidelines are important, along with careful interannotator experiments, to validate the generated gene lists. Also, abstracts alone are a poor resource for identifying genes in paper, containing only a fraction of genes mentioned in the full text (25% for Fly, 36% for Mouse). We found that there are intrinsic differences between the model organism databases related to the number of synonymous terms and also to curation criteria. Finally, we found that answer pooling was much faster and allowed us to identify more conflicting genes than interannotator analysis. BioMed Central 2005-05-24 /pmc/articles/PMC1869005/ /pubmed/15960824 http://dx.doi.org/10.1186/1471-2105-6-S1-S12 Text en Copyright © 2005 Colosimo et al; licensee BioMed Central Ltd http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Report
Colosimo, Marc E
Morgan, Alexander A
Yeh, Alexander S
Colombe, Jeffrey B
Hirschman, Lynette
Data preparation and interannotator agreement: BioCreAtIvE Task 1B
title Data preparation and interannotator agreement: BioCreAtIvE Task 1B
title_full Data preparation and interannotator agreement: BioCreAtIvE Task 1B
title_fullStr Data preparation and interannotator agreement: BioCreAtIvE Task 1B
title_full_unstemmed Data preparation and interannotator agreement: BioCreAtIvE Task 1B
title_short Data preparation and interannotator agreement: BioCreAtIvE Task 1B
title_sort data preparation and interannotator agreement: biocreative task 1b
topic Report
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869005/
https://www.ncbi.nlm.nih.gov/pubmed/15960824
http://dx.doi.org/10.1186/1471-2105-6-S1-S12
work_keys_str_mv AT colosimomarce datapreparationandinterannotatoragreementbiocreativetask1b
AT morganalexandera datapreparationandinterannotatoragreementbiocreativetask1b
AT yehalexanders datapreparationandinterannotatoragreementbiocreativetask1b
AT colombejeffreyb datapreparationandinterannotatoragreementbiocreativetask1b
AT hirschmanlynette datapreparationandinterannotatoragreementbiocreativetask1b