Cargando…

Automatic identification of variables in epidemiological datasets using logic regression

BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated ide...

Descripción completa

Detalles Bibliográficos
Autores principales: Lorenz, Matthias W., Abdi, Negin Ashtiani, Scheckenbach, Frank, Pflug, Anja, Bülbül, Alpaslan, Catapano, Alberico L., Agewall, Stefan, Ezhov, Marat, Bots, Michiel L., Kiechl, Stefan, Orth, Andreas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390441/
https://www.ncbi.nlm.nih.gov/pubmed/28407816
http://dx.doi.org/10.1186/s12911-017-0429-1
_version_ 1782521461447065600
author Lorenz, Matthias W.
Abdi, Negin Ashtiani
Scheckenbach, Frank
Pflug, Anja
Bülbül, Alpaslan
Catapano, Alberico L.
Agewall, Stefan
Ezhov, Marat
Bots, Michiel L.
Kiechl, Stefan
Orth, Andreas
author_facet Lorenz, Matthias W.
Abdi, Negin Ashtiani
Scheckenbach, Frank
Pflug, Anja
Bülbül, Alpaslan
Catapano, Alberico L.
Agewall, Stefan
Ezhov, Marat
Bots, Michiel L.
Kiechl, Stefan
Orth, Andreas
author_sort Lorenz, Matthias W.
collection PubMed
description BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-017-0429-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5390441
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53904412017-04-14 Automatic identification of variables in epidemiological datasets using logic regression Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas BMC Med Inform Decis Mak Research Article BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-017-0429-1) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-13 /pmc/articles/PMC5390441/ /pubmed/28407816 http://dx.doi.org/10.1186/s12911-017-0429-1 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Lorenz, Matthias W.
Abdi, Negin Ashtiani
Scheckenbach, Frank
Pflug, Anja
Bülbül, Alpaslan
Catapano, Alberico L.
Agewall, Stefan
Ezhov, Marat
Bots, Michiel L.
Kiechl, Stefan
Orth, Andreas
Automatic identification of variables in epidemiological datasets using logic regression
title Automatic identification of variables in epidemiological datasets using logic regression
title_full Automatic identification of variables in epidemiological datasets using logic regression
title_fullStr Automatic identification of variables in epidemiological datasets using logic regression
title_full_unstemmed Automatic identification of variables in epidemiological datasets using logic regression
title_short Automatic identification of variables in epidemiological datasets using logic regression
title_sort automatic identification of variables in epidemiological datasets using logic regression
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390441/
https://www.ncbi.nlm.nih.gov/pubmed/28407816
http://dx.doi.org/10.1186/s12911-017-0429-1
work_keys_str_mv AT lorenzmatthiasw automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT abdineginashtiani automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT scheckenbachfrank automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT pfluganja automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT bulbulalpaslan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT catapanoalbericol automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT agewallstefan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT ezhovmarat automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT botsmichiell automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT kiechlstefan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT orthandreas automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression
AT automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression