Cargando…

A data value metric for quantifying information content and utility

Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, ho...

Descripción completa

Detalles Bibliográficos
Autores principales:	Noshad, Morteza, Choi, Jerome, Sun, Yuming, Hero, Alfred, Dinov, Ivo D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8550565/ https://www.ncbi.nlm.nih.gov/pubmed/34777945 http://dx.doi.org/10.1186/s40537-021-00446-6

_version_	1784590982095831040
author	Noshad, Morteza Choi, Jerome Sun, Yuming Hero, Alfred Dinov, Ivo D.
author_facet	Noshad, Morteza Choi, Jerome Sun, Yuming Hero, Alfred Dinov, Ivo D.
author_sort	Noshad, Morteza
collection	PubMed
description	Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.
format	Online Article Text
id	pubmed-8550565
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-85505652021-11-10 A data value metric for quantifying information content and utility Noshad, Morteza Choi, Jerome Sun, Yuming Hero, Alfred Dinov, Ivo D. J Big Data Research Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms. Springer International Publishing 2021-06-05 2021 /pmc/articles/PMC8550565/ /pubmed/34777945 http://dx.doi.org/10.1186/s40537-021-00446-6 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Research Noshad, Morteza Choi, Jerome Sun, Yuming Hero, Alfred Dinov, Ivo D. A data value metric for quantifying information content and utility
title	A data value metric for quantifying information content and utility
title_full	A data value metric for quantifying information content and utility
title_fullStr	A data value metric for quantifying information content and utility
title_full_unstemmed	A data value metric for quantifying information content and utility
title_short	A data value metric for quantifying information content and utility
title_sort	data value metric for quantifying information content and utility
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8550565/ https://www.ncbi.nlm.nih.gov/pubmed/34777945 http://dx.doi.org/10.1186/s40537-021-00446-6
work_keys_str_mv	AT noshadmorteza adatavaluemetricforquantifyinginformationcontentandutility AT choijerome adatavaluemetricforquantifyinginformationcontentandutility AT sunyuming adatavaluemetricforquantifyinginformationcontentandutility AT heroalfred adatavaluemetricforquantifyinginformationcontentandutility AT dinovivod adatavaluemetricforquantifyinginformationcontentandutility AT noshadmorteza datavaluemetricforquantifyinginformationcontentandutility AT choijerome datavaluemetricforquantifyinginformationcontentandutility AT sunyuming datavaluemetricforquantifyinginformationcontentandutility AT heroalfred datavaluemetricforquantifyinginformationcontentandutility AT dinovivod datavaluemetricforquantifyinginformationcontentandutility

A data value metric for quantifying information content and utility

Ejemplares similares