Cargando…

Ontology-driven and weakly supervised rare disease identification from clinical notes

BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS: We p...

Descripción completa

Detalles Bibliográficos
Autores principales: Dong, Hang, Suárez-Paniagua, Víctor, Zhang, Huayu, Wang, Minhong, Casey, Arlene, Davidson, Emma, Chen, Jiaoyan, Alex, Beatrice, Whiteley, William, Wu, Honghan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10162001/
https://www.ncbi.nlm.nih.gov/pubmed/37147628
http://dx.doi.org/10.1186/s12911-023-02181-9
_version_ 1785037612395790336
author Dong, Hang
Suárez-Paniagua, Víctor
Zhang, Huayu
Wang, Minhong
Casey, Arlene
Davidson, Emma
Chen, Jiaoyan
Alex, Beatrice
Whiteley, William
Wu, Honghan
author_facet Dong, Hang
Suárez-Paniagua, Víctor
Zhang, Huayu
Wang, Minhong
Casey, Arlene
Davidson, Emma
Chen, Jiaoyan
Alex, Beatrice
Whiteley, William
Wu, Honghan
author_sort Dong, Hang
collection PubMed
description BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS: We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. RESULTS: The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). CONCLUSION: The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-023-02181-9.
format Online
Article
Text
id pubmed-10162001
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-101620012023-05-07 Ontology-driven and weakly supervised rare disease identification from clinical notes Dong, Hang Suárez-Paniagua, Víctor Zhang, Huayu Wang, Minhong Casey, Arlene Davidson, Emma Chen, Jiaoyan Alex, Beatrice Whiteley, William Wu, Honghan BMC Med Inform Decis Mak Research BACKGROUND: Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS: We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. RESULTS: The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). CONCLUSION: The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-023-02181-9. BioMed Central 2023-05-05 /pmc/articles/PMC10162001/ /pubmed/37147628 http://dx.doi.org/10.1186/s12911-023-02181-9 Text en © The Author(s) 2023, corrected publication 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Dong, Hang
Suárez-Paniagua, Víctor
Zhang, Huayu
Wang, Minhong
Casey, Arlene
Davidson, Emma
Chen, Jiaoyan
Alex, Beatrice
Whiteley, William
Wu, Honghan
Ontology-driven and weakly supervised rare disease identification from clinical notes
title Ontology-driven and weakly supervised rare disease identification from clinical notes
title_full Ontology-driven and weakly supervised rare disease identification from clinical notes
title_fullStr Ontology-driven and weakly supervised rare disease identification from clinical notes
title_full_unstemmed Ontology-driven and weakly supervised rare disease identification from clinical notes
title_short Ontology-driven and weakly supervised rare disease identification from clinical notes
title_sort ontology-driven and weakly supervised rare disease identification from clinical notes
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10162001/
https://www.ncbi.nlm.nih.gov/pubmed/37147628
http://dx.doi.org/10.1186/s12911-023-02181-9
work_keys_str_mv AT donghang ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT suarezpaniaguavictor ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT zhanghuayu ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT wangminhong ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT caseyarlene ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT davidsonemma ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT chenjiaoyan ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT alexbeatrice ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT whiteleywilliam ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes
AT wuhonghan ontologydrivenandweaklysupervisedrarediseaseidentificationfromclinicalnotes