Cargando…
Ontology-driven and weakly supervised rare disease identification from clinical notes
BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS: We p...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10162001/ https://www.ncbi.nlm.nih.gov/pubmed/37147628 http://dx.doi.org/10.1186/s12911-023-02181-9 |
_version_ | 1785037612395790336 |
---|---|
author | Dong, Hang Suárez-Paniagua, Víctor Zhang, Huayu Wang, Minhong Casey, Arlene Davidson, Emma Chen, Jiaoyan Alex, Beatrice Whiteley, William Wu, Honghan |
author_facet | Dong, Hang Suárez-Paniagua, Víctor Zhang, Huayu Wang, Minhong Casey, Arlene Davidson, Emma Chen, Jiaoyan Alex, Beatrice Whiteley, William Wu, Honghan |
author_sort | Dong, Hang |
collection | PubMed |
description | BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS: We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. RESULTS: The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). CONCLUSION: The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-023-02181-9. |
format | Online Article Text |
id | pubmed-10162001 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-101620012023-05-07 Ontology-driven and weakly supervised rare disease identification from clinical notes Dong, Hang Suárez-Paniagua, Víctor Zhang, Huayu Wang, Minhong Casey, Arlene Davidson, Emma Chen, Jiaoyan Alex, Beatrice Whiteley, William Wu, Honghan BMC Med Inform Decis Mak Research BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS: We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. RESULTS: The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). CONCLUSION: The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-023-02181-9. BioMed Central 2023-05-05 /pmc/articles/PMC10162001/ /pubmed/37147628 http://dx.doi.org/10.1186/s12911-023-02181-9 Text en © The Author(s) 2023, corrected publication 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Dong, Hang Suárez-Paniagua, Víctor Zhang, Huayu Wang, Minhong Casey, Arlene Davidson, Emma Chen, Jiaoyan Alex, Beatrice Whiteley, William Wu, Honghan Ontology-driven and weakly supervised rare disease identification from clinical notes |
title | Ontology-driven and weakly supervised rare disease identification from clinical notes |
title_full | Ontology-driven and weakly supervised rare disease identification from clinical notes |
title_fullStr | Ontology-driven and weakly supervised rare disease identification from clinical notes |
title_full_unstemmed | Ontology-driven and weakly supervised rare disease identification from clinical notes |
title_short | Ontology-driven and weakly supervised rare disease identification from clinical notes |
title_sort | ontology-driven and weakly supervised rare disease identification from clinical notes |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10162001/ https://www.ncbi.nlm.nih.gov/pubmed/37147628 http://dx.doi.org/10.1186/s12911-023-02181-9 |
work_keys_str_mv | AT donghang ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT suarezpaniaguavictor ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT zhanghuayu ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT wangminhong ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT caseyarlene ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT davidsonemma ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT chenjiaoyan ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT alexbeatrice ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT whiteleywilliam ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes AT wuhonghan ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes |