Cargando…
SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks
BACKGROUND: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms....
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9080187/ https://www.ncbi.nlm.nih.gov/pubmed/35527259 http://dx.doi.org/10.1186/s13326-022-00269-1 |
_version_ | 1784702727320764416 |
---|---|
author | Oliveira, Lucas Emanuel Silva e Peters, Ana Carolina da Silva, Adalniza Moura Pucca Gebeluca, Caroline Pilatti Gumiel, Yohan Bonescki Cintho, Lilian Mie Mukai Carvalho, Deborah Ribeiro Al Hasan, Sadid Moro, Claudia Maria Cabral |
author_facet | Oliveira, Lucas Emanuel Silva e Peters, Ana Carolina da Silva, Adalniza Moura Pucca Gebeluca, Caroline Pilatti Gumiel, Yohan Bonescki Cintho, Lilian Mie Mukai Carvalho, Deborah Ribeiro Al Hasan, Sadid Moro, Claudia Maria Cabral |
author_sort | Oliveira, Lucas Emanuel Silva e |
collection | PubMed |
description | BACKGROUND: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. METHODS: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. RESULTS: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. CONCLUSION: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus. |
format | Online Article Text |
id | pubmed-9080187 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-90801872022-05-09 SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks Oliveira, Lucas Emanuel Silva e Peters, Ana Carolina da Silva, Adalniza Moura Pucca Gebeluca, Caroline Pilatti Gumiel, Yohan Bonescki Cintho, Lilian Mie Mukai Carvalho, Deborah Ribeiro Al Hasan, Sadid Moro, Claudia Maria Cabral J Biomed Semantics Research BACKGROUND: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. METHODS: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. RESULTS: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. CONCLUSION: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus. BioMed Central 2022-05-08 /pmc/articles/PMC9080187/ /pubmed/35527259 http://dx.doi.org/10.1186/s13326-022-00269-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Oliveira, Lucas Emanuel Silva e Peters, Ana Carolina da Silva, Adalniza Moura Pucca Gebeluca, Caroline Pilatti Gumiel, Yohan Bonescki Cintho, Lilian Mie Mukai Carvalho, Deborah Ribeiro Al Hasan, Sadid Moro, Claudia Maria Cabral SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks |
title | SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks |
title_full | SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks |
title_fullStr | SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks |
title_full_unstemmed | SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks |
title_short | SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks |
title_sort | semclinbr - a multi-institutional and multi-specialty semantically annotated corpus for portuguese clinical nlp tasks |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9080187/ https://www.ncbi.nlm.nih.gov/pubmed/35527259 http://dx.doi.org/10.1186/s13326-022-00269-1 |
work_keys_str_mv | AT oliveiralucasemanuelsilvae semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT petersanacarolina semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT dasilvaadalnizamourapucca semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT gebelucacarolinepilatti semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT gumielyohanbonescki semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT cintholilianmiemukai semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT carvalhodeborahribeiro semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT alhasansadid semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks AT moroclaudiamariacabral semclinbramultiinstitutionalandmultispecialtysemanticallyannotatedcorpusforportugueseclinicalnlptasks |