Cargando…
ICARUS: Minimizing Human Effort in Iterative Data Completion
An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6553872/ https://www.ncbi.nlm.nih.gov/pubmed/31179156 |
_version_ | 1783424890045464576 |
---|---|
author | Rahman, Protiva Hebert, Courtney Nandi, Arnab |
author_facet | Rahman, Protiva Hebert, Courtney Nandi, Arnab |
author_sort | Rahman, Protiva |
collection | PubMed |
description | An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present ICARUS, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users’ input to multiple missing fields by using the database schema to infer hierarchies. Simulations show ICARUS has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks. |
format | Online Article Text |
id | pubmed-6553872 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
record_format | MEDLINE/PubMed |
spelling | pubmed-65538722019-06-06 ICARUS: Minimizing Human Effort in Iterative Data Completion Rahman, Protiva Hebert, Courtney Nandi, Arnab Proceedings VLDB Endowment Article An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present ICARUS, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users’ input to multiple missing fields by using the database schema to infer hierarchies. Simulations show ICARUS has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks. 2018-09 /pmc/articles/PMC6553872/ /pubmed/31179156 Text en http://creativecommons.org/licenses/by-nc-nd/4.0/ This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. |
spellingShingle | Article Rahman, Protiva Hebert, Courtney Nandi, Arnab ICARUS: Minimizing Human Effort in Iterative Data Completion |
title | ICARUS: Minimizing Human Effort in Iterative Data Completion |
title_full | ICARUS: Minimizing Human Effort in Iterative Data Completion |
title_fullStr | ICARUS: Minimizing Human Effort in Iterative Data Completion |
title_full_unstemmed | ICARUS: Minimizing Human Effort in Iterative Data Completion |
title_short | ICARUS: Minimizing Human Effort in Iterative Data Completion |
title_sort | icarus: minimizing human effort in iterative data completion |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6553872/ https://www.ncbi.nlm.nih.gov/pubmed/31179156 |
work_keys_str_mv | AT rahmanprotiva icarusminimizinghumaneffortiniterativedatacompletion AT hebertcourtney icarusminimizinghumaneffortiniterativedatacompletion AT nandiarnab icarusminimizinghumaneffortiniterativedatacompletion |