Cargando…

ICARUS: Minimizing Human Effort in Iterative Data Completion

An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and...

Descripción completa

Detalles Bibliográficos
Autores principales: Rahman, Protiva, Hebert, Courtney, Nandi, Arnab
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6553872/
https://www.ncbi.nlm.nih.gov/pubmed/31179156
_version_ 1783424890045464576
author Rahman, Protiva
Hebert, Courtney
Nandi, Arnab
author_facet Rahman, Protiva
Hebert, Courtney
Nandi, Arnab
author_sort Rahman, Protiva
collection PubMed
description An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present ICARUS, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users’ input to multiple missing fields by using the database schema to infer hierarchies. Simulations show ICARUS has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks.
format Online
Article
Text
id pubmed-6553872
institution National Center for Biotechnology Information
language English
publishDate 2018
record_format MEDLINE/PubMed
spelling pubmed-65538722019-06-06 ICARUS: Minimizing Human Effort in Iterative Data Completion Rahman, Protiva Hebert, Courtney Nandi, Arnab Proceedings VLDB Endowment Article An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present ICARUS, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users’ input to multiple missing fields by using the database schema to infer hierarchies. Simulations show ICARUS has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks. 2018-09 /pmc/articles/PMC6553872/ /pubmed/31179156 Text en http://creativecommons.org/licenses/by-nc-nd/4.0/ This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org.
spellingShingle Article
Rahman, Protiva
Hebert, Courtney
Nandi, Arnab
ICARUS: Minimizing Human Effort in Iterative Data Completion
title ICARUS: Minimizing Human Effort in Iterative Data Completion
title_full ICARUS: Minimizing Human Effort in Iterative Data Completion
title_fullStr ICARUS: Minimizing Human Effort in Iterative Data Completion
title_full_unstemmed ICARUS: Minimizing Human Effort in Iterative Data Completion
title_short ICARUS: Minimizing Human Effort in Iterative Data Completion
title_sort icarus: minimizing human effort in iterative data completion
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6553872/
https://www.ncbi.nlm.nih.gov/pubmed/31179156
work_keys_str_mv AT rahmanprotiva icarusminimizinghumaneffortiniterativedatacompletion
AT hebertcourtney icarusminimizinghumaneffortiniterativedatacompletion
AT nandiarnab icarusminimizinghumaneffortiniterativedatacompletion