Cargando…

A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

BACKGROUND: Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited...

Descripción completa

Detalles Bibliográficos
Autores principales: Horiguchi, Hiromasa, Yasunaga, Hideo, Hashimoto, Hideki, Ohe, Kazuhiko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3545829/
https://www.ncbi.nlm.nih.gov/pubmed/23259862
http://dx.doi.org/10.1186/1472-6947-12-151
_version_ 1782255941829263360
author Horiguchi, Hiromasa
Yasunaga, Hideo
Hashimoto, Hideki
Ohe, Kazuhiko
author_facet Horiguchi, Hiromasa
Yasunaga, Hideo
Hashimoto, Hideki
Ohe, Kazuhiko
author_sort Horiguchi, Hiromasa
collection PubMed
description BACKGROUND: Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions. RESULTS: Having prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time. CONCLUSIONS: Our newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research.
format Online
Article
Text
id pubmed-3545829
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35458292013-01-17 A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script Horiguchi, Hiromasa Yasunaga, Hideo Hashimoto, Hideki Ohe, Kazuhiko BMC Med Inform Decis Mak Software BACKGROUND: Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions. RESULTS: Having prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time. CONCLUSIONS: Our newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research. BioMed Central 2012-12-22 /pmc/articles/PMC3545829/ /pubmed/23259862 http://dx.doi.org/10.1186/1472-6947-12-151 Text en Copyright ©2012 Horiguchi et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Horiguchi, Hiromasa
Yasunaga, Hideo
Hashimoto, Hideki
Ohe, Kazuhiko
A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
title A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
title_full A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
title_fullStr A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
title_full_unstemmed A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
title_short A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
title_sort user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3545829/
https://www.ncbi.nlm.nih.gov/pubmed/23259862
http://dx.doi.org/10.1186/1472-6947-12-151
work_keys_str_mv AT horiguchihiromasa auserfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript
AT yasunagahideo auserfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript
AT hashimotohideki auserfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript
AT ohekazuhiko auserfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript
AT horiguchihiromasa userfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript
AT yasunagahideo userfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript
AT hashimotohideki userfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript
AT ohekazuhiko userfriendlytooltotransformlargescaleadministrativedataintowidetableformatusingamapreduceprogramwithapiglatinbasedscript