Cargando…

DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets

BACKGROUND: Thanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insig...

Descripción completa

Detalles Bibliográficos
Autores principales:	Eid, Rayan, Landès, Claudine, Pernet, Alix, Benoît, Emmanuel, Santagostini, Pierre, Ghaziri, Angelina El, Bourbeillon, Julie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8981856/ https://www.ncbi.nlm.nih.gov/pubmed/35379292 http://dx.doi.org/10.1186/s13040-022-00293-y

_version_	1784681688034443264
author	Eid, Rayan Landès, Claudine Pernet, Alix Benoît, Emmanuel Santagostini, Pierre Ghaziri, Angelina El Bourbeillon, Julie
author_facet	Eid, Rayan Landès, Claudine Pernet, Alix Benoît, Emmanuel Santagostini, Pierre Ghaziri, Angelina El Bourbeillon, Julie
author_sort	Eid, Rayan
collection	PubMed
description	BACKGROUND: Thanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insight into these datasets and reduce the data matrices size scientists often rely on multivariate analysis techniques. However such approaches are not always easily practicable in particular when faced with mixed datasets. Moreover displaying large numbers of individuals leads to cluttered visualisations which are difficult to interpret. RESULTS: We introduced a new methodology to overcome these limits. Its main feature is a new semantic distance tailored for both quantitative and qualitative variables which allows for a realistic representation of the relationships between individuals (phenotypic descriptions in our case). This semantic distance is based on ontologies which are engineered to represent real-life knowledge regarding the underlying variables. For easier handling by biologists, we incorporated its use into a complete tool, from raw data file to visualisation. Following the distance calculation, the next steps performed by the tool consist in (i) grouping similar individuals, (ii) representing each group by emblematic individuals we call archetypes and (iii) building sparse visualisations based on these archetypes. Our approach was implemented as a Python pipeline and applied to a rosebush dataset including passport and phenotypic data. CONCLUSIONS: The introduction of our new semantic distance and of the archetype concept allowed us to build a comprehensive representation of an incomplete dataset characterised by a large proportion of qualitative data. The methodology described here could have wider use beyond information characterizing organisms or species and beyond plant science. Indeed we could apply the same approach to any mixed dataset. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s13040-022-00293-y).
format	Online Article Text
id	pubmed-8981856
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-89818562022-04-06 DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets Eid, Rayan Landès, Claudine Pernet, Alix Benoît, Emmanuel Santagostini, Pierre Ghaziri, Angelina El Bourbeillon, Julie BioData Min Research BACKGROUND: Thanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insight into these datasets and reduce the data matrices size scientists often rely on multivariate analysis techniques. However such approaches are not always easily practicable in particular when faced with mixed datasets. Moreover displaying large numbers of individuals leads to cluttered visualisations which are difficult to interpret. RESULTS: We introduced a new methodology to overcome these limits. Its main feature is a new semantic distance tailored for both quantitative and qualitative variables which allows for a realistic representation of the relationships between individuals (phenotypic descriptions in our case). This semantic distance is based on ontologies which are engineered to represent real-life knowledge regarding the underlying variables. For easier handling by biologists, we incorporated its use into a complete tool, from raw data file to visualisation. Following the distance calculation, the next steps performed by the tool consist in (i) grouping similar individuals, (ii) representing each group by emblematic individuals we call archetypes and (iii) building sparse visualisations based on these archetypes. Our approach was implemented as a Python pipeline and applied to a rosebush dataset including passport and phenotypic data. CONCLUSIONS: The introduction of our new semantic distance and of the archetype concept allowed us to build a comprehensive representation of an incomplete dataset characterised by a large proportion of qualitative data. The methodology described here could have wider use beyond information characterizing organisms or species and beyond plant science. Indeed we could apply the same approach to any mixed dataset. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s13040-022-00293-y). BioMed Central 2022-04-04 /pmc/articles/PMC8981856/ /pubmed/35379292 http://dx.doi.org/10.1186/s13040-022-00293-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Eid, Rayan Landès, Claudine Pernet, Alix Benoît, Emmanuel Santagostini, Pierre Ghaziri, Angelina El Bourbeillon, Julie DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets
title	DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets
title_full	DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets
title_fullStr	DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets
title_full_unstemmed	DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets
title_short	DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets
title_sort	divis: a semantic distance to improve the visualisation of heterogeneous phenotypic datasets
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8981856/ https://www.ncbi.nlm.nih.gov/pubmed/35379292 http://dx.doi.org/10.1186/s13040-022-00293-y
work_keys_str_mv	AT eidrayan divisasemanticdistancetoimprovethevisualisationofheterogeneousphenotypicdatasets AT landesclaudine divisasemanticdistancetoimprovethevisualisationofheterogeneousphenotypicdatasets AT pernetalix divisasemanticdistancetoimprovethevisualisationofheterogeneousphenotypicdatasets AT benoitemmanuel divisasemanticdistancetoimprovethevisualisationofheterogeneousphenotypicdatasets AT santagostinipierre divisasemanticdistancetoimprovethevisualisationofheterogeneousphenotypicdatasets AT ghaziriangelinael divisasemanticdistancetoimprovethevisualisationofheterogeneousphenotypicdatasets AT bourbeillonjulie divisasemanticdistancetoimprovethevisualisationofheterogeneousphenotypicdatasets

DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets

Ejemplares similares