Cargando…

Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis

BACKGROUND: As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were de...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng, Fengbo, Abeysinghe, Rashmie, Cui, Licong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8579614/
https://www.ncbi.nlm.nih.gov/pubmed/34753458
http://dx.doi.org/10.1186/s12911-021-01592-w
_version_ 1784596463056060416
author Zheng, Fengbo
Abeysinghe, Rashmie
Cui, Licong
author_facet Zheng, Fengbo
Abeysinghe, Rashmie
Cui, Licong
author_sort Zheng, Fengbo
collection PubMed
description BACKGROUND: As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. METHODS: We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. RESULTS: We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. CONCLUSION: Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies.
format Online
Article
Text
id pubmed-8579614
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-85796142021-11-10 Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis Zheng, Fengbo Abeysinghe, Rashmie Cui, Licong BMC Med Inform Decis Mak Research BACKGROUND: As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. METHODS: We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. RESULTS: We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. CONCLUSION: Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies. BioMed Central 2021-11-09 /pmc/articles/PMC8579614/ /pubmed/34753458 http://dx.doi.org/10.1186/s12911-021-01592-w Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Zheng, Fengbo
Abeysinghe, Rashmie
Cui, Licong
Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_full Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_fullStr Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_full_unstemmed Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_short Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
title_sort identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8579614/
https://www.ncbi.nlm.nih.gov/pubmed/34753458
http://dx.doi.org/10.1186/s12911-021-01592-w
work_keys_str_mv AT zhengfengbo identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis
AT abeysingherashmie identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis
AT cuilicong identificationofmissingconceptsinbiomedicalterminologiesusingsequencebasedformalconceptanalysis