Cargando…

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

BACKGROUND: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms....

Descripción completa

Detalles Bibliográficos
Autores principales: Oliveira, Lucas Emanuel Silva e, Peters, Ana Carolina, da Silva, Adalniza Moura Pucca, Gebeluca, Caroline Pilatti, Gumiel, Yohan Bonescki, Cintho, Lilian Mie Mukai, Carvalho, Deborah Ribeiro, Al Hasan, Sadid, Moro, Claudia Maria Cabral
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9080187/
https://www.ncbi.nlm.nih.gov/pubmed/35527259
http://dx.doi.org/10.1186/s13326-022-00269-1
_version_ 1784702727320764416
author Oliveira, Lucas Emanuel Silva e
Peters, Ana Carolina
da Silva, Adalniza Moura Pucca
Gebeluca, Caroline Pilatti
Gumiel, Yohan Bonescki
Cintho, Lilian Mie Mukai
Carvalho, Deborah Ribeiro
Al Hasan, Sadid
Moro, Claudia Maria Cabral
author_facet Oliveira, Lucas Emanuel Silva e
Peters, Ana Carolina
da Silva, Adalniza Moura Pucca
Gebeluca, Caroline Pilatti
Gumiel, Yohan Bonescki
Cintho, Lilian Mie Mukai
Carvalho, Deborah Ribeiro
Al Hasan, Sadid
Moro, Claudia Maria Cabral
author_sort Oliveira, Lucas Emanuel Silva e
collection PubMed
description BACKGROUND: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. METHODS: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. RESULTS: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. CONCLUSION: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.
format Online
Article
Text
id pubmed-9080187
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-90801872022-05-09 SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks Oliveira, Lucas Emanuel Silva e Peters, Ana Carolina da Silva, Adalniza Moura Pucca Gebeluca, Caroline Pilatti Gumiel, Yohan Bonescki Cintho, Lilian Mie Mukai Carvalho, Deborah Ribeiro Al Hasan, Sadid Moro, Claudia Maria Cabral J Biomed Semantics Research BACKGROUND: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. METHODS: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. RESULTS: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. CONCLUSION: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus. BioMed Central 2022-05-08 /pmc/articles/PMC9080187/ /pubmed/35527259 http://dx.doi.org/10.1186/s13326-022-00269-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Oliveira, Lucas Emanuel Silva e
Peters, Ana Carolina
da Silva, Adalniza Moura Pucca
Gebeluca, Caroline Pilatti
Gumiel, Yohan Bonescki
Cintho, Lilian Mie Mukai
Carvalho, Deborah Ribeiro
Al Hasan, Sadid
Moro, Claudia Maria Cabral
SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_full SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_fullStr SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_full_unstemmed SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_short SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
title_sort semclinbr - a multi-institutional and multi-specialty semantically annotated corpus for portuguese clinical nlp tasks
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9080187/
https://www.ncbi.nlm.nih.gov/pubmed/35527259
http://dx.doi.org/10.1186/s13326-022-00269-1
work_keys_str_mv AT oliveiralucasemanuelsilvae semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT petersanacarolina semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT dasilvaadalnizamourapucca semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT gebelucacarolinepilatti semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT gumielyohanbonescki semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT cintholilianmiemukai semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT carvalhodeborahribeiro semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT alhasansadid semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks
AT moroclaudiamariacabral semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks