Cargando…
Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes
We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and pr...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6326114/ https://www.ncbi.nlm.nih.gov/pubmed/30620344 http://dx.doi.org/10.1038/sdata.2018.298 |
_version_ | 1783386245051711488 |
---|---|
author | Hejblum, Boris P. Weber, Griffin M. Liao, Katherine P. Palmer, Nathan P. Churchill, Susanne Shadick, Nancy A. Szolovits, Peter Murphy, Shawn N. Kohane, Isaac S. Cai, Tianxi |
author_facet | Hejblum, Boris P. Weber, Griffin M. Liao, Katherine P. Palmer, Nathan P. Churchill, Susanne Shadick, Nancy A. Szolovits, Peter Murphy, Shawn N. Kohane, Isaac S. Cai, Tianxi |
author_sort | Hejblum, Boris P. |
collection | PubMed |
description | We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources. |
format | Online Article Text |
id | pubmed-6326114 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-63261142019-01-10 Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes Hejblum, Boris P. Weber, Griffin M. Liao, Katherine P. Palmer, Nathan P. Churchill, Susanne Shadick, Nancy A. Szolovits, Peter Murphy, Shawn N. Kohane, Isaac S. Cai, Tianxi Sci Data Article We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources. Nature Publishing Group 2019-01-08 /pmc/articles/PMC6326114/ /pubmed/30620344 http://dx.doi.org/10.1038/sdata.2018.298 Text en Copyright © 2019, The Author(s) http://creativecommons.org/licenses/by/4.0/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ |
spellingShingle | Article Hejblum, Boris P. Weber, Griffin M. Liao, Katherine P. Palmer, Nathan P. Churchill, Susanne Shadick, Nancy A. Szolovits, Peter Murphy, Shawn N. Kohane, Isaac S. Cai, Tianxi Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes |
title | Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes |
title_full | Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes |
title_fullStr | Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes |
title_full_unstemmed | Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes |
title_short | Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes |
title_sort | probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6326114/ https://www.ncbi.nlm.nih.gov/pubmed/30620344 http://dx.doi.org/10.1038/sdata.2018.298 |
work_keys_str_mv | AT hejblumborisp probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT webergriffinm probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT liaokatherinep probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT palmernathanp probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT churchillsusanne probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT shadicknancya probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT szolovitspeter probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT murphyshawnn probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT kohaneisaacs probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes AT caitianxi probabilisticrecordlinkageofdeidentifiedresearchdatasetswithdiscrepanciesusingdiagnosiscodes |