Cargando…
BC4GO: a full-text corpus for the BioCreative IV GO task
Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and suppor...
Autores principales: | , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4112614/ https://www.ncbi.nlm.nih.gov/pubmed/25070993 http://dx.doi.org/10.1093/database/bau074 |
_version_ | 1782328191835176960 |
---|---|
author | Van Auken, Kimberly Schaeffer, Mary L. McQuilton, Peter Laulederkind, Stanley J. F. Li, Donghui Wang, Shur-Jen Hayman, G. Thomas Tweedie, Susan Arighi, Cecilia N. Done, James Müller, Hans-Michael Sternberg, Paul W. Mao, Yuqing Wei, Chih-Hsuan Lu, Zhiyong |
author_facet | Van Auken, Kimberly Schaeffer, Mary L. McQuilton, Peter Laulederkind, Stanley J. F. Li, Donghui Wang, Shur-Jen Hayman, G. Thomas Tweedie, Susan Arighi, Cecilia N. Done, James Müller, Hans-Michael Sternberg, Paul W. Mao, Yuqing Wei, Chih-Hsuan Lu, Zhiyong |
author_sort | Van Auken, Kimberly |
collection | PubMed |
description | Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F(1)-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community. Database URL: http://www.biocreative.org/resources/corpora/bc-iv-go-task-corpus/. |
format | Online Article Text |
id | pubmed-4112614 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-41126142014-07-31 BC4GO: a full-text corpus for the BioCreative IV GO task Van Auken, Kimberly Schaeffer, Mary L. McQuilton, Peter Laulederkind, Stanley J. F. Li, Donghui Wang, Shur-Jen Hayman, G. Thomas Tweedie, Susan Arighi, Cecilia N. Done, James Müller, Hans-Michael Sternberg, Paul W. Mao, Yuqing Wei, Chih-Hsuan Lu, Zhiyong Database (Oxford) Original Article Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F(1)-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community. Database URL: http://www.biocreative.org/resources/corpora/bc-iv-go-task-corpus/. Oxford University Press 2014-07-28 /pmc/articles/PMC4112614/ /pubmed/25070993 http://dx.doi.org/10.1093/database/bau074 Text en Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US. |
spellingShingle | Original Article Van Auken, Kimberly Schaeffer, Mary L. McQuilton, Peter Laulederkind, Stanley J. F. Li, Donghui Wang, Shur-Jen Hayman, G. Thomas Tweedie, Susan Arighi, Cecilia N. Done, James Müller, Hans-Michael Sternberg, Paul W. Mao, Yuqing Wei, Chih-Hsuan Lu, Zhiyong BC4GO: a full-text corpus for the BioCreative IV GO task |
title | BC4GO: a full-text corpus for the BioCreative IV GO task |
title_full | BC4GO: a full-text corpus for the BioCreative IV GO task |
title_fullStr | BC4GO: a full-text corpus for the BioCreative IV GO task |
title_full_unstemmed | BC4GO: a full-text corpus for the BioCreative IV GO task |
title_short | BC4GO: a full-text corpus for the BioCreative IV GO task |
title_sort | bc4go: a full-text corpus for the biocreative iv go task |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4112614/ https://www.ncbi.nlm.nih.gov/pubmed/25070993 http://dx.doi.org/10.1093/database/bau074 |
work_keys_str_mv | AT vanaukenkimberly bc4goafulltextcorpusforthebiocreativeivgotask AT schaeffermaryl bc4goafulltextcorpusforthebiocreativeivgotask AT mcquiltonpeter bc4goafulltextcorpusforthebiocreativeivgotask AT laulederkindstanleyjf bc4goafulltextcorpusforthebiocreativeivgotask AT lidonghui bc4goafulltextcorpusforthebiocreativeivgotask AT wangshurjen bc4goafulltextcorpusforthebiocreativeivgotask AT haymangthomas bc4goafulltextcorpusforthebiocreativeivgotask AT tweediesusan bc4goafulltextcorpusforthebiocreativeivgotask AT arighicecilian bc4goafulltextcorpusforthebiocreativeivgotask AT donejames bc4goafulltextcorpusforthebiocreativeivgotask AT mullerhansmichael bc4goafulltextcorpusforthebiocreativeivgotask AT sternbergpaulw bc4goafulltextcorpusforthebiocreativeivgotask AT maoyuqing bc4goafulltextcorpusforthebiocreativeivgotask AT weichihhsuan bc4goafulltextcorpusforthebiocreativeivgotask AT luzhiyong bc4goafulltextcorpusforthebiocreativeivgotask |