Cargando…

ICARUS: Minimizing Human Effort in Iterative Data Completion

An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rahman, Protiva, Hebert, Courtney, Nandi, Arnab
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6553872/ https://www.ncbi.nlm.nih.gov/pubmed/31179156

_version_	1783424890045464576
author	Rahman, Protiva Hebert, Courtney Nandi, Arnab
author_facet	Rahman, Protiva Hebert, Courtney Nandi, Arnab
author_sort	Rahman, Protiva
collection	PubMed
description	An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present ICARUS, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users’ input to multiple missing fields by using the database schema to infer hierarchies. Simulations show ICARUS has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks.
format	Online Article Text
id	pubmed-6553872
institution	National Center for Biotechnology Information
language	English
publishDate	2018
record_format	MEDLINE/PubMed
spelling	pubmed-65538722019-06-06 ICARUS: Minimizing Human Effort in Iterative Data Completion Rahman, Protiva Hebert, Courtney Nandi, Arnab Proceedings VLDB Endowment Article An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present ICARUS, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users’ input to multiple missing fields by using the database schema to infer hierarchies. Simulations show ICARUS has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks. 2018-09 /pmc/articles/PMC6553872/ /pubmed/31179156 Text en http://creativecommons.org/licenses/by-nc-nd/4.0/ This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org.
spellingShingle	Article Rahman, Protiva Hebert, Courtney Nandi, Arnab ICARUS: Minimizing Human Effort in Iterative Data Completion
title	ICARUS: Minimizing Human Effort in Iterative Data Completion
title_full	ICARUS: Minimizing Human Effort in Iterative Data Completion
title_fullStr	ICARUS: Minimizing Human Effort in Iterative Data Completion
title_full_unstemmed	ICARUS: Minimizing Human Effort in Iterative Data Completion
title_short	ICARUS: Minimizing Human Effort in Iterative Data Completion
title_sort	icarus: minimizing human effort in iterative data completion
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6553872/ https://www.ncbi.nlm.nih.gov/pubmed/31179156
work_keys_str_mv	AT rahmanprotiva icarusminimizinghumaneffortiniterativedatacompletion AT hebertcourtney icarusminimizinghumaneffortiniterativedatacompletion AT nandiarnab icarusminimizinghumaneffortiniterativedatacompletion

ICARUS: Minimizing Human Effort in Iterative Data Completion

Ejemplares similares