Cargando…

De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation

BACKGROUND: Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifi...

Descripción completa

Detalles Bibliográficos
Autores principales: Cardinal, Rudolf N., Moore, Anna, Burchell, Martin, Lewis, Jonathan R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163749/
https://www.ncbi.nlm.nih.gov/pubmed/37147600
http://dx.doi.org/10.1186/s12911-023-02176-6
_version_ 1785037948723396608
author Cardinal, Rudolf N.
Moore, Anna
Burchell, Martin
Lewis, Jonathan R.
author_facet Cardinal, Rudolf N.
Moore, Anna
Burchell, Martin
Lewis, Jonathan R.
author_sort Cardinal, Rudolf N.
collection PubMed
description BACKGROUND: Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. METHODS: We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. RESULTS: The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband’s presence in the sample database with an area under the receiver operating curve of 0.997–0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931–0.994), and the misidentification rate was 0.00249 (range 0.00123–0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. CONCLUSIONS: Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-023-02176-6.
format Online
Article
Text
id pubmed-10163749
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-101637492023-05-07 De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation Cardinal, Rudolf N. Moore, Anna Burchell, Martin Lewis, Jonathan R. BMC Med Inform Decis Mak Research BACKGROUND: Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. METHODS: We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. RESULTS: The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband’s presence in the sample database with an area under the receiver operating curve of 0.997–0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931–0.994), and the misidentification rate was 0.00249 (range 0.00123–0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. CONCLUSIONS: Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-023-02176-6. BioMed Central 2023-05-05 /pmc/articles/PMC10163749/ /pubmed/37147600 http://dx.doi.org/10.1186/s12911-023-02176-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Cardinal, Rudolf N.
Moore, Anna
Burchell, Martin
Lewis, Jonathan R.
De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
title De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
title_full De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
title_fullStr De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
title_full_unstemmed De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
title_short De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
title_sort de-identified bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163749/
https://www.ncbi.nlm.nih.gov/pubmed/37147600
http://dx.doi.org/10.1186/s12911-023-02176-6
work_keys_str_mv AT cardinalrudolfn deidentifiedbayesianpersonalidentitymatchingforprivacypreservingrecordlinkagedespiteerrorsdevelopmentandvalidation
AT mooreanna deidentifiedbayesianpersonalidentitymatchingforprivacypreservingrecordlinkagedespiteerrorsdevelopmentandvalidation
AT burchellmartin deidentifiedbayesianpersonalidentitymatchingforprivacypreservingrecordlinkagedespiteerrorsdevelopmentandvalidation
AT lewisjonathanr deidentifiedbayesianpersonalidentitymatchingforprivacypreservingrecordlinkagedespiteerrorsdevelopmentandvalidation