Cargando…

HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey

BACKGROUND: Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ont...

Descripción completa

Detalles Bibliográficos
Autores principales: Lastra-Díaz, Juan J., Lara-Clares, Alicia, Garcia-Serrano, Ana
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8734250/
https://www.ncbi.nlm.nih.gov/pubmed/34991460
http://dx.doi.org/10.1186/s12859-021-04539-0
_version_ 1784627977410052096
author Lastra-Díaz, Juan J.
Lara-Clares, Alicia
Garcia-Serrano, Ana
author_facet Lastra-Díaz, Juan J.
Lara-Clares, Alicia
Garcia-Serrano, Ana
author_sort Lastra-Díaz, Juan J.
collection PubMed
description BACKGROUND: Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure. RESULTS: To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra’s algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure. CONCLUSIONS: We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04539-0.
format Online
Article
Text
id pubmed-8734250
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-87342502022-01-07 HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey Lastra-Díaz, Juan J. Lara-Clares, Alicia Garcia-Serrano, Ana BMC Bioinformatics Software BACKGROUND: Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure. RESULTS: To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra’s algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure. CONCLUSIONS: We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04539-0. BioMed Central 2022-01-06 /pmc/articles/PMC8734250/ /pubmed/34991460 http://dx.doi.org/10.1186/s12859-021-04539-0 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Lastra-Díaz, Juan J.
Lara-Clares, Alicia
Garcia-Serrano, Ana
HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey
title HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey
title_full HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey
title_fullStr HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey
title_full_unstemmed HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey
title_short HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey
title_sort hesml: a real-time semantic measures library for the biomedical domain with a reproducible survey
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8734250/
https://www.ncbi.nlm.nih.gov/pubmed/34991460
http://dx.doi.org/10.1186/s12859-021-04539-0
work_keys_str_mv AT lastradiazjuanj hesmlarealtimesemanticmeasureslibraryforthebiomedicaldomainwithareproduciblesurvey
AT laraclaresalicia hesmlarealtimesemanticmeasureslibraryforthebiomedicaldomainwithareproduciblesurvey
AT garciaserranoana hesmlarealtimesemanticmeasureslibraryforthebiomedicaldomainwithareproduciblesurvey