Cargando…

Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets

MOTIVATION: Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is th...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Yi, Elsworth, Benjamin L, Gaunt, Tom R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10097433/
https://www.ncbi.nlm.nih.gov/pubmed/37010521
http://dx.doi.org/10.1093/bioinformatics/btad169
_version_ 1785024578497544192
author Liu, Yi
Elsworth, Benjamin L
Gaunt, Tom R
author_facet Liu, Yi
Elsworth, Benjamin L
Gaunt, Tom R
author_sort Liu, Yi
collection PubMed
description MOTIVATION: Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here, we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping. RESULTS: In our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity. AVAILABILITY AND IMPLEMENTATION: Our code is available at https://github.com/MRCIEU/vectology.
format Online
Article
Text
id pubmed-10097433
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-100974332023-04-13 Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets Liu, Yi Elsworth, Benjamin L Gaunt, Tom R Bioinformatics Original Paper MOTIVATION: Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here, we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping. RESULTS: In our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity. AVAILABILITY AND IMPLEMENTATION: Our code is available at https://github.com/MRCIEU/vectology. Oxford University Press 2023-04-03 /pmc/articles/PMC10097433/ /pubmed/37010521 http://dx.doi.org/10.1093/bioinformatics/btad169 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Liu, Yi
Elsworth, Benjamin L
Gaunt, Tom R
Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
title Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
title_full Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
title_fullStr Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
title_full_unstemmed Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
title_short Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
title_sort using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10097433/
https://www.ncbi.nlm.nih.gov/pubmed/37010521
http://dx.doi.org/10.1093/bioinformatics/btad169
work_keys_str_mv AT liuyi usinglanguagemodelsandontologytopologytoperformsemanticmappingoftraitsbetweenbiomedicaldatasets
AT elsworthbenjaminl usinglanguagemodelsandontologytopologytoperformsemanticmappingoftraitsbetweenbiomedicaldatasets
AT gaunttomr usinglanguagemodelsandontologytopologytoperformsemanticmappingoftraitsbetweenbiomedicaldatasets