Cargando…

The gene normalization task in BioCreative III

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles w...

Descripción completa

Detalles Bibliográficos
Autores principales: Lu, Zhiyong, Kao, Hung-Yu, Wei, Chih-Hsuan, Huang, Minlie, Liu, Jingchen, Kuo, Cheng-Ju, Hsu, Chun-Nan, Tsai, Richard Tzong-Han, Dai, Hong-Jie, Okazaki, Naoaki, Cho, Han-Cheol, Gerner, Martin, Solt, Illes, Agarwal, Shashank, Liu, Feifan, Vishnyakova, Dina, Ruch, Patrick, Romacker, Martin, Rinaldi, Fabio, Bhattacharya, Sanmitra, Srinivasan, Padmini, Liu, Hongfang, Torii, Manabu, Matos, Sergio, Campos, David, Verspoor, Karin, Livingston, Kevin M, Wilbur, W John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269937/
https://www.ncbi.nlm.nih.gov/pubmed/22151901
http://dx.doi.org/10.1186/1471-2105-12-S8-S2
_version_ 1782222523057831936
author Lu, Zhiyong
Kao, Hung-Yu
Wei, Chih-Hsuan
Huang, Minlie
Liu, Jingchen
Kuo, Cheng-Ju
Hsu, Chun-Nan
Tsai, Richard Tzong-Han
Dai, Hong-Jie
Okazaki, Naoaki
Cho, Han-Cheol
Gerner, Martin
Solt, Illes
Agarwal, Shashank
Liu, Feifan
Vishnyakova, Dina
Ruch, Patrick
Romacker, Martin
Rinaldi, Fabio
Bhattacharya, Sanmitra
Srinivasan, Padmini
Liu, Hongfang
Torii, Manabu
Matos, Sergio
Campos, David
Verspoor, Karin
Livingston, Kevin M
Wilbur, W John
author_facet Lu, Zhiyong
Kao, Hung-Yu
Wei, Chih-Hsuan
Huang, Minlie
Liu, Jingchen
Kuo, Cheng-Ju
Hsu, Chun-Nan
Tsai, Richard Tzong-Han
Dai, Hong-Jie
Okazaki, Naoaki
Cho, Han-Cheol
Gerner, Martin
Solt, Illes
Agarwal, Shashank
Liu, Feifan
Vishnyakova, Dina
Ruch, Patrick
Romacker, Martin
Rinaldi, Fabio
Bhattacharya, Sanmitra
Srinivasan, Padmini
Liu, Hongfang
Torii, Manabu
Matos, Sergio
Campos, David
Verspoor, Karin
Livingston, Kevin M
Wilbur, W John
author_sort Lu, Zhiyong
collection PubMed
description BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
format Online
Article
Text
id pubmed-3269937
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32699372012-02-02 The gene normalization task in BioCreative III Lu, Zhiyong Kao, Hung-Yu Wei, Chih-Hsuan Huang, Minlie Liu, Jingchen Kuo, Cheng-Ju Hsu, Chun-Nan Tsai, Richard Tzong-Han Dai, Hong-Jie Okazaki, Naoaki Cho, Han-Cheol Gerner, Martin Solt, Illes Agarwal, Shashank Liu, Feifan Vishnyakova, Dina Ruch, Patrick Romacker, Martin Rinaldi, Fabio Bhattacharya, Sanmitra Srinivasan, Padmini Liu, Hongfang Torii, Manabu Matos, Sergio Campos, David Verspoor, Karin Livingston, Kevin M Wilbur, W John BMC Bioinformatics Research BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance. BioMed Central 2011-10-03 /pmc/articles/PMC3269937/ /pubmed/22151901 http://dx.doi.org/10.1186/1471-2105-12-S8-S2 Text en Copyright ©2011 Lu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Lu, Zhiyong
Kao, Hung-Yu
Wei, Chih-Hsuan
Huang, Minlie
Liu, Jingchen
Kuo, Cheng-Ju
Hsu, Chun-Nan
Tsai, Richard Tzong-Han
Dai, Hong-Jie
Okazaki, Naoaki
Cho, Han-Cheol
Gerner, Martin
Solt, Illes
Agarwal, Shashank
Liu, Feifan
Vishnyakova, Dina
Ruch, Patrick
Romacker, Martin
Rinaldi, Fabio
Bhattacharya, Sanmitra
Srinivasan, Padmini
Liu, Hongfang
Torii, Manabu
Matos, Sergio
Campos, David
Verspoor, Karin
Livingston, Kevin M
Wilbur, W John
The gene normalization task in BioCreative III
title The gene normalization task in BioCreative III
title_full The gene normalization task in BioCreative III
title_fullStr The gene normalization task in BioCreative III
title_full_unstemmed The gene normalization task in BioCreative III
title_short The gene normalization task in BioCreative III
title_sort gene normalization task in biocreative iii
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269937/
https://www.ncbi.nlm.nih.gov/pubmed/22151901
http://dx.doi.org/10.1186/1471-2105-12-S8-S2
work_keys_str_mv AT luzhiyong thegenenormalizationtaskinbiocreativeiii
AT kaohungyu thegenenormalizationtaskinbiocreativeiii
AT weichihhsuan thegenenormalizationtaskinbiocreativeiii
AT huangminlie thegenenormalizationtaskinbiocreativeiii
AT liujingchen thegenenormalizationtaskinbiocreativeiii
AT kuochengju thegenenormalizationtaskinbiocreativeiii
AT hsuchunnan thegenenormalizationtaskinbiocreativeiii
AT tsairichardtzonghan thegenenormalizationtaskinbiocreativeiii
AT daihongjie thegenenormalizationtaskinbiocreativeiii
AT okazakinaoaki thegenenormalizationtaskinbiocreativeiii
AT chohancheol thegenenormalizationtaskinbiocreativeiii
AT gernermartin thegenenormalizationtaskinbiocreativeiii
AT soltilles thegenenormalizationtaskinbiocreativeiii
AT agarwalshashank thegenenormalizationtaskinbiocreativeiii
AT liufeifan thegenenormalizationtaskinbiocreativeiii
AT vishnyakovadina thegenenormalizationtaskinbiocreativeiii
AT ruchpatrick thegenenormalizationtaskinbiocreativeiii
AT romackermartin thegenenormalizationtaskinbiocreativeiii
AT rinaldifabio thegenenormalizationtaskinbiocreativeiii
AT bhattacharyasanmitra thegenenormalizationtaskinbiocreativeiii
AT srinivasanpadmini thegenenormalizationtaskinbiocreativeiii
AT liuhongfang thegenenormalizationtaskinbiocreativeiii
AT toriimanabu thegenenormalizationtaskinbiocreativeiii
AT matossergio thegenenormalizationtaskinbiocreativeiii
AT camposdavid thegenenormalizationtaskinbiocreativeiii
AT verspoorkarin thegenenormalizationtaskinbiocreativeiii
AT livingstonkevinm thegenenormalizationtaskinbiocreativeiii
AT wilburwjohn thegenenormalizationtaskinbiocreativeiii
AT luzhiyong genenormalizationtaskinbiocreativeiii
AT kaohungyu genenormalizationtaskinbiocreativeiii
AT weichihhsuan genenormalizationtaskinbiocreativeiii
AT huangminlie genenormalizationtaskinbiocreativeiii
AT liujingchen genenormalizationtaskinbiocreativeiii
AT kuochengju genenormalizationtaskinbiocreativeiii
AT hsuchunnan genenormalizationtaskinbiocreativeiii
AT tsairichardtzonghan genenormalizationtaskinbiocreativeiii
AT daihongjie genenormalizationtaskinbiocreativeiii
AT okazakinaoaki genenormalizationtaskinbiocreativeiii
AT chohancheol genenormalizationtaskinbiocreativeiii
AT gernermartin genenormalizationtaskinbiocreativeiii
AT soltilles genenormalizationtaskinbiocreativeiii
AT agarwalshashank genenormalizationtaskinbiocreativeiii
AT liufeifan genenormalizationtaskinbiocreativeiii
AT vishnyakovadina genenormalizationtaskinbiocreativeiii
AT ruchpatrick genenormalizationtaskinbiocreativeiii
AT romackermartin genenormalizationtaskinbiocreativeiii
AT rinaldifabio genenormalizationtaskinbiocreativeiii
AT bhattacharyasanmitra genenormalizationtaskinbiocreativeiii
AT srinivasanpadmini genenormalizationtaskinbiocreativeiii
AT liuhongfang genenormalizationtaskinbiocreativeiii
AT toriimanabu genenormalizationtaskinbiocreativeiii
AT matossergio genenormalizationtaskinbiocreativeiii
AT camposdavid genenormalizationtaskinbiocreativeiii
AT verspoorkarin genenormalizationtaskinbiocreativeiii
AT livingstonkevinm genenormalizationtaskinbiocreativeiii
AT wilburwjohn genenormalizationtaskinbiocreativeiii