Cargando…

BC4GO: a full-text corpus for the BioCreative IV GO task

Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and suppor...

Descripción completa

Detalles Bibliográficos
Autores principales: Van Auken, Kimberly, Schaeffer, Mary L., McQuilton, Peter, Laulederkind, Stanley J. F., Li, Donghui, Wang, Shur-Jen, Hayman, G. Thomas, Tweedie, Susan, Arighi, Cecilia N., Done, James, Müller, Hans-Michael, Sternberg, Paul W., Mao, Yuqing, Wei, Chih-Hsuan, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4112614/
https://www.ncbi.nlm.nih.gov/pubmed/25070993
http://dx.doi.org/10.1093/database/bau074
_version_ 1782328191835176960
author Van Auken, Kimberly
Schaeffer, Mary L.
McQuilton, Peter
Laulederkind, Stanley J. F.
Li, Donghui
Wang, Shur-Jen
Hayman, G. Thomas
Tweedie, Susan
Arighi, Cecilia N.
Done, James
Müller, Hans-Michael
Sternberg, Paul W.
Mao, Yuqing
Wei, Chih-Hsuan
Lu, Zhiyong
author_facet Van Auken, Kimberly
Schaeffer, Mary L.
McQuilton, Peter
Laulederkind, Stanley J. F.
Li, Donghui
Wang, Shur-Jen
Hayman, G. Thomas
Tweedie, Susan
Arighi, Cecilia N.
Done, James
Müller, Hans-Michael
Sternberg, Paul W.
Mao, Yuqing
Wei, Chih-Hsuan
Lu, Zhiyong
author_sort Van Auken, Kimberly
collection PubMed
description Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F(1)-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community. Database URL: http://www.biocreative.org/resources/corpora/bc-iv-go-task-corpus/.
format Online
Article
Text
id pubmed-4112614
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-41126142014-07-31 BC4GO: a full-text corpus for the BioCreative IV GO task Van Auken, Kimberly Schaeffer, Mary L. McQuilton, Peter Laulederkind, Stanley J. F. Li, Donghui Wang, Shur-Jen Hayman, G. Thomas Tweedie, Susan Arighi, Cecilia N. Done, James Müller, Hans-Michael Sternberg, Paul W. Mao, Yuqing Wei, Chih-Hsuan Lu, Zhiyong Database (Oxford) Original Article Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F(1)-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community. Database URL: http://www.biocreative.org/resources/corpora/bc-iv-go-task-corpus/. Oxford University Press 2014-07-28 /pmc/articles/PMC4112614/ /pubmed/25070993 http://dx.doi.org/10.1093/database/bau074 Text en Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
spellingShingle Original Article
Van Auken, Kimberly
Schaeffer, Mary L.
McQuilton, Peter
Laulederkind, Stanley J. F.
Li, Donghui
Wang, Shur-Jen
Hayman, G. Thomas
Tweedie, Susan
Arighi, Cecilia N.
Done, James
Müller, Hans-Michael
Sternberg, Paul W.
Mao, Yuqing
Wei, Chih-Hsuan
Lu, Zhiyong
BC4GO: a full-text corpus for the BioCreative IV GO task
title BC4GO: a full-text corpus for the BioCreative IV GO task
title_full BC4GO: a full-text corpus for the BioCreative IV GO task
title_fullStr BC4GO: a full-text corpus for the BioCreative IV GO task
title_full_unstemmed BC4GO: a full-text corpus for the BioCreative IV GO task
title_short BC4GO: a full-text corpus for the BioCreative IV GO task
title_sort bc4go: a full-text corpus for the biocreative iv go task
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4112614/
https://www.ncbi.nlm.nih.gov/pubmed/25070993
http://dx.doi.org/10.1093/database/bau074
work_keys_str_mv AT vanaukenkimberly bc4goafulltextcorpusforthebiocreativeivgotask
AT schaeffermaryl bc4goafulltextcorpusforthebiocreativeivgotask
AT mcquiltonpeter bc4goafulltextcorpusforthebiocreativeivgotask
AT laulederkindstanleyjf bc4goafulltextcorpusforthebiocreativeivgotask
AT lidonghui bc4goafulltextcorpusforthebiocreativeivgotask
AT wangshurjen bc4goafulltextcorpusforthebiocreativeivgotask
AT haymangthomas bc4goafulltextcorpusforthebiocreativeivgotask
AT tweediesusan bc4goafulltextcorpusforthebiocreativeivgotask
AT arighicecilian bc4goafulltextcorpusforthebiocreativeivgotask
AT donejames bc4goafulltextcorpusforthebiocreativeivgotask
AT mullerhansmichael bc4goafulltextcorpusforthebiocreativeivgotask
AT sternbergpaulw bc4goafulltextcorpusforthebiocreativeivgotask
AT maoyuqing bc4goafulltextcorpusforthebiocreativeivgotask
AT weichihhsuan bc4goafulltextcorpusforthebiocreativeivgotask
AT luzhiyong bc4goafulltextcorpusforthebiocreativeivgotask