Cargando…
Annotation and initial evaluation of a large annotated German oncological corpus
OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is...
Autores principales: | , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054032/ https://www.ncbi.nlm.nih.gov/pubmed/33898938 http://dx.doi.org/10.1093/jamiaopen/ooab025 |
_version_ | 1783680230025592832 |
---|---|
author | Kittner, Madeleine Lamping, Mario Rieke, Damian T Götze, Julian Bajwa, Bariya Jelas, Ivan Rüter, Gina Hautow, Hanjo Sänger, Mario Habibi, Maryam Zettwitz, Marit de Bortoli, Till Ostermann, Leonie Ševa, Jurica Starlinger, Johannes Kohlbacher, Oliver Malek, Nisar P Keilholz, Ulrich Leser, Ulf |
author_facet | Kittner, Madeleine Lamping, Mario Rieke, Damian T Götze, Julian Bajwa, Bariya Jelas, Ivan Rüter, Gina Hautow, Hanjo Sänger, Mario Habibi, Maryam Zettwitz, Marit de Bortoli, Till Ostermann, Leonie Ševa, Jurica Starlinger, Johannes Kohlbacher, Oliver Malek, Nisar P Keilholz, Ulrich Leser, Ulf |
author_sort | Kittner, Madeleine |
collection | PubMed |
description | OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. MATERIALS AND METHODS: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. RESULTS: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72–0.90 for named entity recognition, 0.10–0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. DISCUSSION: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. CONCLUSION: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English. |
format | Online Article Text |
id | pubmed-8054032 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-80540322021-04-22 Annotation and initial evaluation of a large annotated German oncological corpus Kittner, Madeleine Lamping, Mario Rieke, Damian T Götze, Julian Bajwa, Bariya Jelas, Ivan Rüter, Gina Hautow, Hanjo Sänger, Mario Habibi, Maryam Zettwitz, Marit de Bortoli, Till Ostermann, Leonie Ševa, Jurica Starlinger, Johannes Kohlbacher, Oliver Malek, Nisar P Keilholz, Ulrich Leser, Ulf JAMIA Open Research and Applications OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. MATERIALS AND METHODS: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. RESULTS: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72–0.90 for named entity recognition, 0.10–0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. DISCUSSION: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. CONCLUSION: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English. Oxford University Press 2021-04-19 /pmc/articles/PMC8054032/ /pubmed/33898938 http://dx.doi.org/10.1093/jamiaopen/ooab025 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Research and Applications Kittner, Madeleine Lamping, Mario Rieke, Damian T Götze, Julian Bajwa, Bariya Jelas, Ivan Rüter, Gina Hautow, Hanjo Sänger, Mario Habibi, Maryam Zettwitz, Marit de Bortoli, Till Ostermann, Leonie Ševa, Jurica Starlinger, Johannes Kohlbacher, Oliver Malek, Nisar P Keilholz, Ulrich Leser, Ulf Annotation and initial evaluation of a large annotated German oncological corpus |
title | Annotation and initial evaluation of a large annotated German oncological corpus |
title_full | Annotation and initial evaluation of a large annotated German oncological corpus |
title_fullStr | Annotation and initial evaluation of a large annotated German oncological corpus |
title_full_unstemmed | Annotation and initial evaluation of a large annotated German oncological corpus |
title_short | Annotation and initial evaluation of a large annotated German oncological corpus |
title_sort | annotation and initial evaluation of a large annotated german oncological corpus |
topic | Research and Applications |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054032/ https://www.ncbi.nlm.nih.gov/pubmed/33898938 http://dx.doi.org/10.1093/jamiaopen/ooab025 |
work_keys_str_mv | AT kittnermadeleine annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT lampingmario annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT riekedamiant annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT gotzejulian annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT bajwabariya annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT jelasivan annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT rutergina annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT hautowhanjo annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT sangermario annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT habibimaryam annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT zettwitzmarit annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT debortolitill annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT ostermannleonie annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT sevajurica annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT starlingerjohannes annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT kohlbacheroliver annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT maleknisarp annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT keilholzulrich annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus AT leserulf annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus |