Cargando…

Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint

With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a “data disaster.” Effective data analysis a...

Descripción completa

Detalles Bibliográficos
Autores principales: Guo, Manping, Wang, Yiming, Yang, Qiaoning, Li, Rui, Zhao, Yang, Li, Chenfei, Zhu, Mingbo, Cui, Yao, Jiang, Xin, Sheng, Song, Li, Qingna, Gao, Rui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557005/
https://www.ncbi.nlm.nih.gov/pubmed/37733421
http://dx.doi.org/10.2196/44310
_version_ 1785116994269347840
author Guo, Manping
Wang, Yiming
Yang, Qiaoning
Li, Rui
Zhao, Yang
Li, Chenfei
Zhu, Mingbo
Cui, Yao
Jiang, Xin
Sheng, Song
Li, Qingna
Gao, Rui
author_facet Guo, Manping
Wang, Yiming
Yang, Qiaoning
Li, Rui
Zhao, Yang
Li, Chenfei
Zhu, Mingbo
Cui, Yao
Jiang, Xin
Sheng, Song
Li, Qingna
Gao, Rui
author_sort Guo, Manping
collection PubMed
description With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a “data disaster.” Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting “dirty data,” which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.
format Online
Article
Text
id pubmed-10557005
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-105570052023-10-07 Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint Guo, Manping Wang, Yiming Yang, Qiaoning Li, Rui Zhao, Yang Li, Chenfei Zhu, Mingbo Cui, Yao Jiang, Xin Sheng, Song Li, Qingna Gao, Rui Interact J Med Res Viewpoint With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a “data disaster.” Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting “dirty data,” which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning. JMIR Publications 2023-09-21 /pmc/articles/PMC10557005/ /pubmed/37733421 http://dx.doi.org/10.2196/44310 Text en ©Manping Guo, Yiming Wang, Qiaoning Yang, Rui Li, Yang Zhao, Chenfei Li, Mingbo Zhu, Yao Cui, Xin Jiang, Song Sheng, Qingna Li, Rui Gao. Originally published in the Interactive Journal of Medical Research (https://www.i-jmr.org/), 21.09.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Interactive Journal of Medical Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.i-jmr.org/, as well as this copyright and license information must be included.
spellingShingle Viewpoint
Guo, Manping
Wang, Yiming
Yang, Qiaoning
Li, Rui
Zhao, Yang
Li, Chenfei
Zhu, Mingbo
Cui, Yao
Jiang, Xin
Sheng, Song
Li, Qingna
Gao, Rui
Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint
title Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint
title_full Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint
title_fullStr Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint
title_full_unstemmed Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint
title_short Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint
title_sort normal workflow and key strategies for data cleaning toward real-world data: viewpoint
topic Viewpoint
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557005/
https://www.ncbi.nlm.nih.gov/pubmed/37733421
http://dx.doi.org/10.2196/44310
work_keys_str_mv AT guomanping normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT wangyiming normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT yangqiaoning normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT lirui normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT zhaoyang normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT lichenfei normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT zhumingbo normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT cuiyao normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT jiangxin normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT shengsong normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT liqingna normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint
AT gaorui normalworkflowandkeystrategiesfordatacleaningtowardrealworlddataviewpoint