Cargando…

Annotation and initial evaluation of a large annotated German oncological corpus

OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is...

Descripción completa

Detalles Bibliográficos
Autores principales: Kittner, Madeleine, Lamping, Mario, Rieke, Damian T, Götze, Julian, Bajwa, Bariya, Jelas, Ivan, Rüter, Gina, Hautow, Hanjo, Sänger, Mario, Habibi, Maryam, Zettwitz, Marit, de Bortoli, Till, Ostermann, Leonie, Ševa, Jurica, Starlinger, Johannes, Kohlbacher, Oliver, Malek, Nisar P, Keilholz, Ulrich, Leser, Ulf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054032/
https://www.ncbi.nlm.nih.gov/pubmed/33898938
http://dx.doi.org/10.1093/jamiaopen/ooab025
_version_ 1783680230025592832
author Kittner, Madeleine
Lamping, Mario
Rieke, Damian T
Götze, Julian
Bajwa, Bariya
Jelas, Ivan
Rüter, Gina
Hautow, Hanjo
Sänger, Mario
Habibi, Maryam
Zettwitz, Marit
de Bortoli, Till
Ostermann, Leonie
Ševa, Jurica
Starlinger, Johannes
Kohlbacher, Oliver
Malek, Nisar P
Keilholz, Ulrich
Leser, Ulf
author_facet Kittner, Madeleine
Lamping, Mario
Rieke, Damian T
Götze, Julian
Bajwa, Bariya
Jelas, Ivan
Rüter, Gina
Hautow, Hanjo
Sänger, Mario
Habibi, Maryam
Zettwitz, Marit
de Bortoli, Till
Ostermann, Leonie
Ševa, Jurica
Starlinger, Johannes
Kohlbacher, Oliver
Malek, Nisar P
Keilholz, Ulrich
Leser, Ulf
author_sort Kittner, Madeleine
collection PubMed
description OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. MATERIALS AND METHODS: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. RESULTS: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72–0.90 for named entity recognition, 0.10–0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. DISCUSSION: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. CONCLUSION: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.
format Online
Article
Text
id pubmed-8054032
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-80540322021-04-22 Annotation and initial evaluation of a large annotated German oncological corpus Kittner, Madeleine Lamping, Mario Rieke, Damian T Götze, Julian Bajwa, Bariya Jelas, Ivan Rüter, Gina Hautow, Hanjo Sänger, Mario Habibi, Maryam Zettwitz, Marit de Bortoli, Till Ostermann, Leonie Ševa, Jurica Starlinger, Johannes Kohlbacher, Oliver Malek, Nisar P Keilholz, Ulrich Leser, Ulf JAMIA Open Research and Applications OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. MATERIALS AND METHODS: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. RESULTS: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72–0.90 for named entity recognition, 0.10–0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. DISCUSSION: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. CONCLUSION: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English. Oxford University Press 2021-04-19 /pmc/articles/PMC8054032/ /pubmed/33898938 http://dx.doi.org/10.1093/jamiaopen/ooab025 Text en © The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Research and Applications
Kittner, Madeleine
Lamping, Mario
Rieke, Damian T
Götze, Julian
Bajwa, Bariya
Jelas, Ivan
Rüter, Gina
Hautow, Hanjo
Sänger, Mario
Habibi, Maryam
Zettwitz, Marit
de Bortoli, Till
Ostermann, Leonie
Ševa, Jurica
Starlinger, Johannes
Kohlbacher, Oliver
Malek, Nisar P
Keilholz, Ulrich
Leser, Ulf
Annotation and initial evaluation of a large annotated German oncological corpus
title Annotation and initial evaluation of a large annotated German oncological corpus
title_full Annotation and initial evaluation of a large annotated German oncological corpus
title_fullStr Annotation and initial evaluation of a large annotated German oncological corpus
title_full_unstemmed Annotation and initial evaluation of a large annotated German oncological corpus
title_short Annotation and initial evaluation of a large annotated German oncological corpus
title_sort annotation and initial evaluation of a large annotated german oncological corpus
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8054032/
https://www.ncbi.nlm.nih.gov/pubmed/33898938
http://dx.doi.org/10.1093/jamiaopen/ooab025
work_keys_str_mv AT kittnermadeleine annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT lampingmario annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT riekedamiant annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT gotzejulian annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT bajwabariya annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT jelasivan annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT rutergina annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT hautowhanjo annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT sangermario annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT habibimaryam annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT zettwitzmarit annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT debortolitill annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT ostermannleonie annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT sevajurica annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT starlingerjohannes annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT kohlbacheroliver annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT maleknisarp annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT keilholzulrich annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus
AT leserulf annotationandinitialevaluationofalargeannotatedgermanoncologicalcorpus