Cargando…
Automatic identification of variables in epidemiological datasets using logic regression
BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated ide...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390441/ https://www.ncbi.nlm.nih.gov/pubmed/28407816 http://dx.doi.org/10.1186/s12911-017-0429-1 |
_version_ | 1782521461447065600 |
---|---|
author | Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas |
author_facet | Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas |
author_sort | Lorenz, Matthias W. |
collection | PubMed |
description | BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-017-0429-1) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5390441 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-53904412017-04-14 Automatic identification of variables in epidemiological datasets using logic regression Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas BMC Med Inform Decis Mak Research Article BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-017-0429-1) contains supplementary material, which is available to authorized users. BioMed Central 2017-04-13 /pmc/articles/PMC5390441/ /pubmed/28407816 http://dx.doi.org/10.1186/s12911-017-0429-1 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Lorenz, Matthias W. Abdi, Negin Ashtiani Scheckenbach, Frank Pflug, Anja Bülbül, Alpaslan Catapano, Alberico L. Agewall, Stefan Ezhov, Marat Bots, Michiel L. Kiechl, Stefan Orth, Andreas Automatic identification of variables in epidemiological datasets using logic regression |
title | Automatic identification of variables in epidemiological datasets using logic regression |
title_full | Automatic identification of variables in epidemiological datasets using logic regression |
title_fullStr | Automatic identification of variables in epidemiological datasets using logic regression |
title_full_unstemmed | Automatic identification of variables in epidemiological datasets using logic regression |
title_short | Automatic identification of variables in epidemiological datasets using logic regression |
title_sort | automatic identification of variables in epidemiological datasets using logic regression |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390441/ https://www.ncbi.nlm.nih.gov/pubmed/28407816 http://dx.doi.org/10.1186/s12911-017-0429-1 |
work_keys_str_mv | AT lorenzmatthiasw automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT abdineginashtiani automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT scheckenbachfrank automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT pfluganja automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT bulbulalpaslan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT catapanoalbericol automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT agewallstefan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT ezhovmarat automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT botsmichiell automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT kiechlstefan automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT orthandreas automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression AT automaticidentificationofvariablesinepidemiologicaldatasetsusinglogicregression |