Cargando…

Automatic identification of variables in epidemiological datasets using logic regression

BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated ide...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lorenz, Matthias W., Abdi, Negin Ashtiani, Scheckenbach, Frank, Pflug, Anja, Bülbül, Alpaslan, Catapano, Alberico L., Agewall, Stefan, Ezhov, Marat, Bots, Michiel L., Kiechl, Stefan, Orth, Andreas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390441/ https://www.ncbi.nlm.nih.gov/pubmed/28407816 http://dx.doi.org/10.1186/s12911-017-0429-1

_version_	1782521461447065600
author	Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas
author_facet	Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas
author_sort	Lorenz, Matthias W.
collection	PubMed
description	BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-017-0429-1) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5390441
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53904412017-04-14 Automatic identification of variables in epidemiological datasets using logic regression Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas BMC Med Inform Decis Mak Research Article BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-017-0429-1) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-13 /pmc/articles/PMC5390441/ /pubmed/28407816 http://dx.doi.org/10.1186/s12911-017-0429-1 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas Automatic identification of variables in epidemiological datasets using logic regression
title	Automatic identification of variables in epidemiological datasets using logic regression
title_full	Automatic identification of variables in epidemiological datasets using logic regression
title_fullStr	Automatic identification of variables in epidemiological datasets using logic regression
title_full_unstemmed	Automatic identification of variables in epidemiological datasets using logic regression
title_short	Automatic identification of variables in epidemiological datasets using logic regression
title_sort	automatic identification of variables in epidemiological datasets using logic regression
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390441/ https://www.ncbi.nlm.nih.gov/pubmed/28407816 http://dx.doi.org/10.1186/s12911-017-0429-1
work_keys_str_mv	AT lorenzmatthiasw automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT abdineginashtiani automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT scheckenbachfrank automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT pfluganja automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT bulbulalpaslan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT catapanoalbericol automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT agewallstefan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT ezhovmarat automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT botsmichiell automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT kiechlstefan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT orthandreas automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression

Automatic identification of variables in epidemiological datasets using logic regression

Ejemplares similares