Cargando…

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

BACKGROUND: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand docume...

Descripción completa

Detalles Bibliográficos
Autores principales: Rebholz-Schuhmann, Dietrich, Yepes, Antonio Jimeno, Li, Chen, Kafkas, Senay, Lewin, Ian, Kang, Ning, Corbett, Peter, Milward, David, Buyko, Ekaterina, Beisswanger, Elena, Hornbostel, Kerstin, Kouznetsov, Alexandre, Witte, René, Laurila, Jonas B, Baker, Christopher JO, Kuo, Cheng-Ju, Clematide, Simone, Rinaldi, Fabio, Farkas, Richárd, Móra, György, Hara, Kazuo, Furlong, Laura I, Rautschka, Michael, Neves, Mariana Lara, Pascual-Montano, Alberto, Wei, Qi, Collier, Nigel, Chowdhury, Md Faisal Mahbub, Lavelli, Alberto, Berlanga, Rafael, Morante, Roser, Van Asch, Vincent, Daelemans, Walter, Marina, José Luís, van Mulligen, Erik, Kors, Jan, Hahn, Udo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3239301/
https://www.ncbi.nlm.nih.gov/pubmed/22166494
http://dx.doi.org/10.1186/2041-1480-2-S5-S11
_version_ 1782219163052277760
author Rebholz-Schuhmann, Dietrich
Yepes, Antonio Jimeno
Li, Chen
Kafkas, Senay
Lewin, Ian
Kang, Ning
Corbett, Peter
Milward, David
Buyko, Ekaterina
Beisswanger, Elena
Hornbostel, Kerstin
Kouznetsov, Alexandre
Witte, René
Laurila, Jonas B
Baker, Christopher JO
Kuo, Cheng-Ju
Clematide, Simone
Rinaldi, Fabio
Farkas, Richárd
Móra, György
Hara, Kazuo
Furlong, Laura I
Rautschka, Michael
Neves, Mariana Lara
Pascual-Montano, Alberto
Wei, Qi
Collier, Nigel
Chowdhury, Md Faisal Mahbub
Lavelli, Alberto
Berlanga, Rafael
Morante, Roser
Van Asch, Vincent
Daelemans, Walter
Marina, José Luís
van Mulligen, Erik
Kors, Jan
Hahn, Udo
author_facet Rebholz-Schuhmann, Dietrich
Yepes, Antonio Jimeno
Li, Chen
Kafkas, Senay
Lewin, Ian
Kang, Ning
Corbett, Peter
Milward, David
Buyko, Ekaterina
Beisswanger, Elena
Hornbostel, Kerstin
Kouznetsov, Alexandre
Witte, René
Laurila, Jonas B
Baker, Christopher JO
Kuo, Cheng-Ju
Clematide, Simone
Rinaldi, Fabio
Farkas, Richárd
Móra, György
Hara, Kazuo
Furlong, Laura I
Rautschka, Michael
Neves, Mariana Lara
Pascual-Montano, Alberto
Wei, Qi
Collier, Nigel
Chowdhury, Md Faisal Mahbub
Lavelli, Alberto
Berlanga, Rafael
Morante, Roser
Van Asch, Vincent
Daelemans, Walter
Marina, José Luís
van Mulligen, Erik
Kors, Jan
Hahn, Udo
author_sort Rebholz-Schuhmann, Dietrich
collection PubMed
description BACKGROUND: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. RESULTS: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I. The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. CONCLUSIONS: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.
format Online
Article
Text
id pubmed-3239301
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32393012011-12-16 Assessment of NER solutions against the first and second CALBC Silver Standard Corpus Rebholz-Schuhmann, Dietrich Yepes, Antonio Jimeno Li, Chen Kafkas, Senay Lewin, Ian Kang, Ning Corbett, Peter Milward, David Buyko, Ekaterina Beisswanger, Elena Hornbostel, Kerstin Kouznetsov, Alexandre Witte, René Laurila, Jonas B Baker, Christopher JO Kuo, Cheng-Ju Clematide, Simone Rinaldi, Fabio Farkas, Richárd Móra, György Hara, Kazuo Furlong, Laura I Rautschka, Michael Neves, Mariana Lara Pascual-Montano, Alberto Wei, Qi Collier, Nigel Chowdhury, Md Faisal Mahbub Lavelli, Alberto Berlanga, Rafael Morante, Roser Van Asch, Vincent Daelemans, Walter Marina, José Luís van Mulligen, Erik Kors, Jan Hahn, Udo J Biomed Semantics Research BACKGROUND: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. RESULTS: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I. The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. CONCLUSIONS: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I. BioMed Central 2011-10-06 /pmc/articles/PMC3239301/ /pubmed/22166494 http://dx.doi.org/10.1186/2041-1480-2-S5-S11 Text en Copyright ©2011 Rebholz-Schuhmann et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Rebholz-Schuhmann, Dietrich
Yepes, Antonio Jimeno
Li, Chen
Kafkas, Senay
Lewin, Ian
Kang, Ning
Corbett, Peter
Milward, David
Buyko, Ekaterina
Beisswanger, Elena
Hornbostel, Kerstin
Kouznetsov, Alexandre
Witte, René
Laurila, Jonas B
Baker, Christopher JO
Kuo, Cheng-Ju
Clematide, Simone
Rinaldi, Fabio
Farkas, Richárd
Móra, György
Hara, Kazuo
Furlong, Laura I
Rautschka, Michael
Neves, Mariana Lara
Pascual-Montano, Alberto
Wei, Qi
Collier, Nigel
Chowdhury, Md Faisal Mahbub
Lavelli, Alberto
Berlanga, Rafael
Morante, Roser
Van Asch, Vincent
Daelemans, Walter
Marina, José Luís
van Mulligen, Erik
Kors, Jan
Hahn, Udo
Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_full Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_fullStr Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_full_unstemmed Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_short Assessment of NER solutions against the first and second CALBC Silver Standard Corpus
title_sort assessment of ner solutions against the first and second calbc silver standard corpus
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3239301/
https://www.ncbi.nlm.nih.gov/pubmed/22166494
http://dx.doi.org/10.1186/2041-1480-2-S5-S11
work_keys_str_mv AT rebholzschuhmanndietrich assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT yepesantoniojimeno assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT lichen assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT kafkassenay assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT lewinian assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT kangning assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT corbettpeter assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT milwarddavid assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT buykoekaterina assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT beisswangerelena assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT hornbostelkerstin assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT kouznetsovalexandre assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT witterene assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT laurilajonasb assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT bakerchristopherjo assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT kuochengju assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT clematidesimone assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT rinaldifabio assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT farkasrichard assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT moragyorgy assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT harakazuo assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT furlonglaurai assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT rautschkamichael assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT nevesmarianalara assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT pascualmontanoalberto assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT weiqi assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT colliernigel assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT chowdhurymdfaisalmahbub assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT lavellialberto assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT berlangarafael assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT moranteroser assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT vanaschvincent assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT daelemanswalter assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT marinajoseluis assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT vanmulligenerik assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT korsjan assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus
AT hahnudo assessmentofnersolutionsagainstthefirstandsecondcalbcsilverstandardcorpus