Cargando…

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators p...

Descripción completa

Detalles Bibliográficos
Autores principales: Kors, Jan A, Clematide, Simon, Akhondi, Saber A, van Mulligen, Erik M, Rebholz-Schuhmann, Dietrich
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4986661/
https://www.ncbi.nlm.nih.gov/pubmed/25948699
http://dx.doi.org/10.1093/jamia/ocv037
_version_ 1782448218809827328
author Kors, Jan A
Clematide, Simon
Akhondi, Saber A
van Mulligen, Erik M
Rebholz-Schuhmann, Dietrich
author_facet Kors, Jan A
Clematide, Simon
Akhondi, Saber A
van Mulligen, Erik M
Rebholz-Schuhmann, Dietrich
author_sort Kors, Jan A
collection PubMed
description Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.
format Online
Article
Text
id pubmed-4986661
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-49866612016-09-01 A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC Kors, Jan A Clematide, Simon Akhondi, Saber A van Mulligen, Erik M Rebholz-Schuhmann, Dietrich J Am Med Inform Assoc Focus on Natural Language Processing Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated. Oxford University Press 2015-09 2015-05-05 /pmc/articles/PMC4986661/ /pubmed/25948699 http://dx.doi.org/10.1093/jamia/ocv037 Text en © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Focus on Natural Language Processing
Kors, Jan A
Clematide, Simon
Akhondi, Saber A
van Mulligen, Erik M
Rebholz-Schuhmann, Dietrich
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
title A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
title_full A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
title_fullStr A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
title_full_unstemmed A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
title_short A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC
title_sort multilingual gold-standard corpus for biomedical concept recognition: the mantra gsc
topic Focus on Natural Language Processing
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4986661/
https://www.ncbi.nlm.nih.gov/pubmed/25948699
http://dx.doi.org/10.1093/jamia/ocv037
work_keys_str_mv AT korsjana amultilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT clematidesimon amultilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT akhondisabera amultilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT vanmulligenerikm amultilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT rebholzschuhmanndietrich amultilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT korsjana multilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT clematidesimon multilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT akhondisabera multilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT vanmulligenerikm multilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc
AT rebholzschuhmanndietrich multilingualgoldstandardcorpusforbiomedicalconceptrecognitionthemantragsc