Cargando…

FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data

Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys f...

Descripción completa

Detalles Bibliográficos
Autores principales: Buranosky, Matt, Stellnberger, Elmar, Pfaff, Emily, Diaz-Sanchez, David, Ward-Caviness, Cavin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: F1000 Research Limited 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6489977/
https://www.ncbi.nlm.nih.gov/pubmed/31069050
http://dx.doi.org/10.12688/f1000research.16483.2
_version_ 1783414873297780736
author Buranosky, Matt
Stellnberger, Elmar
Pfaff, Emily
Diaz-Sanchez, David
Ward-Caviness, Cavin
author_facet Buranosky, Matt
Stellnberger, Elmar
Pfaff, Emily
Diaz-Sanchez, David
Ward-Caviness, Cavin
author_sort Buranosky, Matt
collection PubMed
description Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.
format Online
Article
Text
id pubmed-6489977
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher F1000 Research Limited
record_format MEDLINE/PubMed
spelling pubmed-64899772019-05-07 FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data Buranosky, Matt Stellnberger, Elmar Pfaff, Emily Diaz-Sanchez, David Ward-Caviness, Cavin F1000Res Software Tool Article Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. Previous research establishes that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD_Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD_Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 13 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed. F1000 Research Limited 2019-06-19 /pmc/articles/PMC6489977/ /pubmed/31069050 http://dx.doi.org/10.12688/f1000research.16483.2 Text en Copyright: © 2019 Buranosky M et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software Tool Article
Buranosky, Matt
Stellnberger, Elmar
Pfaff, Emily
Diaz-Sanchez, David
Ward-Caviness, Cavin
FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data
title FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data
title_full FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data
title_fullStr FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data
title_full_unstemmed FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data
title_short FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data
title_sort fdtool: a python application to mine for functional dependencies and candidate keys in tabular data
topic Software Tool Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6489977/
https://www.ncbi.nlm.nih.gov/pubmed/31069050
http://dx.doi.org/10.12688/f1000research.16483.2
work_keys_str_mv AT buranoskymatt fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardata
AT stellnbergerelmar fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardata
AT pfaffemily fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardata
AT diazsanchezdavid fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardata
AT wardcavinesscavin fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardata