Cargando…
The gene normalization task in BioCreative III
BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles w...
Autores principales: | , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269937/ https://www.ncbi.nlm.nih.gov/pubmed/22151901 http://dx.doi.org/10.1186/1471-2105-12-S8-S2 |
_version_ | 1782222523057831936 |
---|---|
author | Lu, Zhiyong Kao, Hung-Yu Wei, Chih-Hsuan Huang, Minlie Liu, Jingchen Kuo, Cheng-Ju Hsu, Chun-Nan Tsai, Richard Tzong-Han Dai, Hong-Jie Okazaki, Naoaki Cho, Han-Cheol Gerner, Martin Solt, Illes Agarwal, Shashank Liu, Feifan Vishnyakova, Dina Ruch, Patrick Romacker, Martin Rinaldi, Fabio Bhattacharya, Sanmitra Srinivasan, Padmini Liu, Hongfang Torii, Manabu Matos, Sergio Campos, David Verspoor, Karin Livingston, Kevin M Wilbur, W John |
author_facet | Lu, Zhiyong Kao, Hung-Yu Wei, Chih-Hsuan Huang, Minlie Liu, Jingchen Kuo, Cheng-Ju Hsu, Chun-Nan Tsai, Richard Tzong-Han Dai, Hong-Jie Okazaki, Naoaki Cho, Han-Cheol Gerner, Martin Solt, Illes Agarwal, Shashank Liu, Feifan Vishnyakova, Dina Ruch, Patrick Romacker, Martin Rinaldi, Fabio Bhattacharya, Sanmitra Srinivasan, Padmini Liu, Hongfang Torii, Manabu Matos, Sergio Campos, David Verspoor, Karin Livingston, Kevin M Wilbur, W John |
author_sort | Lu, Zhiyong |
collection | PubMed |
description | BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance. |
format | Online Article Text |
id | pubmed-3269937 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-32699372012-02-02 The gene normalization task in BioCreative III Lu, Zhiyong Kao, Hung-Yu Wei, Chih-Hsuan Huang, Minlie Liu, Jingchen Kuo, Cheng-Ju Hsu, Chun-Nan Tsai, Richard Tzong-Han Dai, Hong-Jie Okazaki, Naoaki Cho, Han-Cheol Gerner, Martin Solt, Illes Agarwal, Shashank Liu, Feifan Vishnyakova, Dina Ruch, Patrick Romacker, Martin Rinaldi, Fabio Bhattacharya, Sanmitra Srinivasan, Padmini Liu, Hongfang Torii, Manabu Matos, Sergio Campos, David Verspoor, Karin Livingston, Kevin M Wilbur, W John BMC Bioinformatics Research BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance. BioMed Central 2011-10-03 /pmc/articles/PMC3269937/ /pubmed/22151901 http://dx.doi.org/10.1186/1471-2105-12-S8-S2 Text en Copyright ©2011 Lu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Lu, Zhiyong Kao, Hung-Yu Wei, Chih-Hsuan Huang, Minlie Liu, Jingchen Kuo, Cheng-Ju Hsu, Chun-Nan Tsai, Richard Tzong-Han Dai, Hong-Jie Okazaki, Naoaki Cho, Han-Cheol Gerner, Martin Solt, Illes Agarwal, Shashank Liu, Feifan Vishnyakova, Dina Ruch, Patrick Romacker, Martin Rinaldi, Fabio Bhattacharya, Sanmitra Srinivasan, Padmini Liu, Hongfang Torii, Manabu Matos, Sergio Campos, David Verspoor, Karin Livingston, Kevin M Wilbur, W John The gene normalization task in BioCreative III |
title | The gene normalization task in BioCreative III |
title_full | The gene normalization task in BioCreative III |
title_fullStr | The gene normalization task in BioCreative III |
title_full_unstemmed | The gene normalization task in BioCreative III |
title_short | The gene normalization task in BioCreative III |
title_sort | gene normalization task in biocreative iii |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269937/ https://www.ncbi.nlm.nih.gov/pubmed/22151901 http://dx.doi.org/10.1186/1471-2105-12-S8-S2 |
work_keys_str_mv | AT luzhiyong thegenenormalizationtaskinbiocreativeiii AT kaohungyu thegenenormalizationtaskinbiocreativeiii AT weichihhsuan thegenenormalizationtaskinbiocreativeiii AT huangminlie thegenenormalizationtaskinbiocreativeiii AT liujingchen thegenenormalizationtaskinbiocreativeiii AT kuochengju thegenenormalizationtaskinbiocreativeiii AT hsuchunnan thegenenormalizationtaskinbiocreativeiii AT tsairichardtzonghan thegenenormalizationtaskinbiocreativeiii AT daihongjie thegenenormalizationtaskinbiocreativeiii AT okazakinaoaki thegenenormalizationtaskinbiocreativeiii AT chohancheol thegenenormalizationtaskinbiocreativeiii AT gernermartin thegenenormalizationtaskinbiocreativeiii AT soltilles thegenenormalizationtaskinbiocreativeiii AT agarwalshashank thegenenormalizationtaskinbiocreativeiii AT liufeifan thegenenormalizationtaskinbiocreativeiii AT vishnyakovadina thegenenormalizationtaskinbiocreativeiii AT ruchpatrick thegenenormalizationtaskinbiocreativeiii AT romackermartin thegenenormalizationtaskinbiocreativeiii AT rinaldifabio thegenenormalizationtaskinbiocreativeiii AT bhattacharyasanmitra thegenenormalizationtaskinbiocreativeiii AT srinivasanpadmini thegenenormalizationtaskinbiocreativeiii AT liuhongfang thegenenormalizationtaskinbiocreativeiii AT toriimanabu thegenenormalizationtaskinbiocreativeiii AT matossergio thegenenormalizationtaskinbiocreativeiii AT camposdavid thegenenormalizationtaskinbiocreativeiii AT verspoorkarin thegenenormalizationtaskinbiocreativeiii AT livingstonkevinm thegenenormalizationtaskinbiocreativeiii AT wilburwjohn thegenenormalizationtaskinbiocreativeiii AT luzhiyong genenormalizationtaskinbiocreativeiii AT kaohungyu genenormalizationtaskinbiocreativeiii AT weichihhsuan genenormalizationtaskinbiocreativeiii AT huangminlie genenormalizationtaskinbiocreativeiii AT liujingchen genenormalizationtaskinbiocreativeiii AT kuochengju genenormalizationtaskinbiocreativeiii AT hsuchunnan genenormalizationtaskinbiocreativeiii AT tsairichardtzonghan genenormalizationtaskinbiocreativeiii AT daihongjie genenormalizationtaskinbiocreativeiii AT okazakinaoaki genenormalizationtaskinbiocreativeiii AT chohancheol genenormalizationtaskinbiocreativeiii AT gernermartin genenormalizationtaskinbiocreativeiii AT soltilles genenormalizationtaskinbiocreativeiii AT agarwalshashank genenormalizationtaskinbiocreativeiii AT liufeifan genenormalizationtaskinbiocreativeiii AT vishnyakovadina genenormalizationtaskinbiocreativeiii AT ruchpatrick genenormalizationtaskinbiocreativeiii AT romackermartin genenormalizationtaskinbiocreativeiii AT rinaldifabio genenormalizationtaskinbiocreativeiii AT bhattacharyasanmitra genenormalizationtaskinbiocreativeiii AT srinivasanpadmini genenormalizationtaskinbiocreativeiii AT liuhongfang genenormalizationtaskinbiocreativeiii AT toriimanabu genenormalizationtaskinbiocreativeiii AT matossergio genenormalizationtaskinbiocreativeiii AT camposdavid genenormalizationtaskinbiocreativeiii AT verspoorkarin genenormalizationtaskinbiocreativeiii AT livingstonkevinm genenormalizationtaskinbiocreativeiii AT wilburwjohn genenormalizationtaskinbiocreativeiii |