Cargando…
Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets
MOTIVATION: Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is th...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10097433/ https://www.ncbi.nlm.nih.gov/pubmed/37010521 http://dx.doi.org/10.1093/bioinformatics/btad169 |
_version_ | 1785024578497544192 |
---|---|
author | Liu, Yi Elsworth, Benjamin L Gaunt, Tom R |
author_facet | Liu, Yi Elsworth, Benjamin L Gaunt, Tom R |
author_sort | Liu, Yi |
collection | PubMed |
description | MOTIVATION: Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here, we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping. RESULTS: In our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity. AVAILABILITY AND IMPLEMENTATION: Our code is available at https://github.com/MRCIEU/vectology. |
format | Online Article Text |
id | pubmed-10097433 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-100974332023-04-13 Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets Liu, Yi Elsworth, Benjamin L Gaunt, Tom R Bioinformatics Original Paper MOTIVATION: Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here, we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping. RESULTS: In our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity. AVAILABILITY AND IMPLEMENTATION: Our code is available at https://github.com/MRCIEU/vectology. Oxford University Press 2023-04-03 /pmc/articles/PMC10097433/ /pubmed/37010521 http://dx.doi.org/10.1093/bioinformatics/btad169 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Liu, Yi Elsworth, Benjamin L Gaunt, Tom R Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets |
title | Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets |
title_full | Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets |
title_fullStr | Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets |
title_full_unstemmed | Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets |
title_short | Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets |
title_sort | using language models and ontology topology to perform semantic mapping of traits between biomedical datasets |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10097433/ https://www.ncbi.nlm.nih.gov/pubmed/37010521 http://dx.doi.org/10.1093/bioinformatics/btad169 |
work_keys_str_mv | AT liuyi usinglanguagemodelsandontologytopologytoperformsemanticmappingoftraitsbetweenbiomedicaldatasets AT elsworthbenjaminl usinglanguagemodelsandontologytopologytoperformsemanticmappingoftraitsbetweenbiomedicaldatasets AT gaunttomr usinglanguagemodelsandontologytopologytoperformsemanticmappingoftraitsbetweenbiomedicaldatasets |