Cargando…

Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach

BACKGROUND: Data requirements by governments, donors and the international community to measure health and development achievements have increased in the last decade. Datasets produced in surveys conducted in several countries and years are often combined to analyse time trends and geographical patt...

Descripción completa

Detalles Bibliográficos
Autor principal:	Bosch-Capblanch, Xavier
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3123542/ https://www.ncbi.nlm.nih.gov/pubmed/21595905 http://dx.doi.org/10.1186/1472-6947-11-33

_version_	1782206987448090624
author	Bosch-Capblanch, Xavier
author_facet	Bosch-Capblanch, Xavier
author_sort	Bosch-Capblanch, Xavier
collection	PubMed
description	BACKGROUND: Data requirements by governments, donors and the international community to measure health and development achievements have increased in the last decade. Datasets produced in surveys conducted in several countries and years are often combined to analyse time trends and geographical patterns of demographic and health related indicators. However, since not all datasets have the same structure, variables definitions and codes, they have to be harmonised prior to submitting them to the statistical analyses. Manually searching, renaming and recoding variables are extremely tedious and prone to errors tasks, overall when the number of datasets and variables are large. This article presents an automated approach to harmonise variables names across several datasets, which optimises the search of variables, minimises manual inputs and reduces the risk of error. RESULTS: Three consecutive algorithms are applied iteratively to search for each variable of interest for the analyses in all datasets. The first search (A) captures particular cases that could not be solved in an automated way in the search iterations; the second search (B) is run if search A produced no hits and identifies variables the labels of which contain certain key terms defined by the user. If this search produces no hits, a third one (C) is run to retrieve variables which have been identified in other surveys, as an illustration. For each variable of interest, the outputs of these engines can be (O1) a single best matching variable is found, (O2) more than one matching variable is found or (O3) not matching variables are found. Output O2 is solved by user judgement. Examples using four variables are presented showing that the searches have a 100% sensitivity and specificity after a second iteration. CONCLUSION: Efficient and tested automated algorithms should be used to support the harmonisation process needed to analyse multiple datasets. This is especially relevant when the numbers of datasets or variables to be included are large.
format	Online Article Text
id	pubmed-3123542
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31235422011-06-26 Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach Bosch-Capblanch, Xavier BMC Med Inform Decis Mak Software BACKGROUND: Data requirements by governments, donors and the international community to measure health and development achievements have increased in the last decade. Datasets produced in surveys conducted in several countries and years are often combined to analyse time trends and geographical patterns of demographic and health related indicators. However, since not all datasets have the same structure, variables definitions and codes, they have to be harmonised prior to submitting them to the statistical analyses. Manually searching, renaming and recoding variables are extremely tedious and prone to errors tasks, overall when the number of datasets and variables are large. This article presents an automated approach to harmonise variables names across several datasets, which optimises the search of variables, minimises manual inputs and reduces the risk of error. RESULTS: Three consecutive algorithms are applied iteratively to search for each variable of interest for the analyses in all datasets. The first search (A) captures particular cases that could not be solved in an automated way in the search iterations; the second search (B) is run if search A produced no hits and identifies variables the labels of which contain certain key terms defined by the user. If this search produces no hits, a third one (C) is run to retrieve variables which have been identified in other surveys, as an illustration. For each variable of interest, the outputs of these engines can be (O1) a single best matching variable is found, (O2) more than one matching variable is found or (O3) not matching variables are found. Output O2 is solved by user judgement. Examples using four variables are presented showing that the searches have a 100% sensitivity and specificity after a second iteration. CONCLUSION: Efficient and tested automated algorithms should be used to support the harmonisation process needed to analyse multiple datasets. This is especially relevant when the numbers of datasets or variables to be included are large. BioMed Central 2011-05-19 /pmc/articles/PMC3123542/ /pubmed/21595905 http://dx.doi.org/10.1186/1472-6947-11-33 Text en Copyright ©2011 Bosch-Capblanch; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Software Bosch-Capblanch, Xavier Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
title	Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
title_full	Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
title_fullStr	Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
title_full_unstemmed	Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
title_short	Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
title_sort	harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3123542/ https://www.ncbi.nlm.nih.gov/pubmed/21595905 http://dx.doi.org/10.1186/1472-6947-11-33
work_keys_str_mv	AT boschcapblanchxavier harmonisationofvariablesnamespriortoconductingstatisticalanalyseswithmultipledatasetsanautomatedapproach

Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach

Ejemplares similares