Cargando…

A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain

The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of...

Descripción completa

Detalles Bibliográficos
Autores principales: Cardoso, Carlota, Sousa, Rita T, Köhler, Sebastian, Pesquita, Catia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7661097/
https://www.ncbi.nlm.nih.gov/pubmed/33181823
http://dx.doi.org/10.1093/database/baaa078
_version_ 1783609140693696512
author Cardoso, Carlota
Sousa, Rita T
Köhler, Sebastian
Pesquita, Catia
author_facet Cardoso, Carlota
Sousa, Rita T
Köhler, Sebastian
Pesquita, Catia
author_sort Cardoso, Carlota
collection PubMed
description The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein–protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein–protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.
format Online
Article
Text
id pubmed-7661097
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-76610972020-11-18 A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain Cardoso, Carlota Sousa, Rita T Köhler, Sebastian Pesquita, Catia Database (Oxford) Original Article The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein–protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein–protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark. Oxford University Press 2020-11-11 /pmc/articles/PMC7661097/ /pubmed/33181823 http://dx.doi.org/10.1093/database/baaa078 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Cardoso, Carlota
Sousa, Rita T
Köhler, Sebastian
Pesquita, Catia
A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain
title A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain
title_full A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain
title_fullStr A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain
title_full_unstemmed A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain
title_short A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain
title_sort collection of benchmark data sets for knowledge graph-based similarity in the biomedical domain
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7661097/
https://www.ncbi.nlm.nih.gov/pubmed/33181823
http://dx.doi.org/10.1093/database/baaa078
work_keys_str_mv AT cardosocarlota acollectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain
AT sousaritat acollectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain
AT kohlersebastian acollectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain
AT pesquitacatia acollectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain
AT cardosocarlota collectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain
AT sousaritat collectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain
AT kohlersebastian collectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain
AT pesquitacatia collectionofbenchmarkdatasetsforknowledgegraphbasedsimilarityinthebiomedicaldomain