Cargando…
A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO...
Autores principales: | , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320036/ https://www.ncbi.nlm.nih.gov/pubmed/34325649 http://dx.doi.org/10.1186/s12874-021-01299-6 |
_version_ | 1783730568971681792 |
---|---|
author | Jiao, Yue Lesueur, Fabienne Azencott, Chloé-Agathe Laurent, Maïté Mebirouk, Noura Laborde, Lilian Beauvallet, Juana Dondon, Marie-Gabrielle Eon-Marchais, Séverine Laugé, Anthony Noguès, Catherine Andrieu, Nadine Stoppa-Lyonnet, Dominique Caputo, Sandrine M. |
author_facet | Jiao, Yue Lesueur, Fabienne Azencott, Chloé-Agathe Laurent, Maïté Mebirouk, Noura Laborde, Lilian Beauvallet, Juana Dondon, Marie-Gabrielle Eon-Marchais, Séverine Laugé, Anthony Noguès, Catherine Andrieu, Nadine Stoppa-Lyonnet, Dominique Caputo, Sandrine M. |
author_sort | Jiao, Yue |
collection | PubMed |
description | BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. METHODS: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). CONCLUSIONS: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01299-6. |
format | Online Article Text |
id | pubmed-8320036 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-83200362021-07-30 A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers Jiao, Yue Lesueur, Fabienne Azencott, Chloé-Agathe Laurent, Maïté Mebirouk, Noura Laborde, Lilian Beauvallet, Juana Dondon, Marie-Gabrielle Eon-Marchais, Séverine Laugé, Anthony Noguès, Catherine Andrieu, Nadine Stoppa-Lyonnet, Dominique Caputo, Sandrine M. BMC Med Res Methodol Research Article BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. METHODS: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). CONCLUSIONS: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01299-6. BioMed Central 2021-07-29 /pmc/articles/PMC8320036/ /pubmed/34325649 http://dx.doi.org/10.1186/s12874-021-01299-6 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Jiao, Yue Lesueur, Fabienne Azencott, Chloé-Agathe Laurent, Maïté Mebirouk, Noura Laborde, Lilian Beauvallet, Juana Dondon, Marie-Gabrielle Eon-Marchais, Séverine Laugé, Anthony Noguès, Catherine Andrieu, Nadine Stoppa-Lyonnet, Dominique Caputo, Sandrine M. A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers |
title | A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers |
title_full | A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers |
title_fullStr | A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers |
title_full_unstemmed | A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers |
title_short | A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers |
title_sort | new hybrid record linkage process to make epidemiological databases interoperable: application to the gemo and genepso studies involving brca1 and brca2 mutation carriers |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320036/ https://www.ncbi.nlm.nih.gov/pubmed/34325649 http://dx.doi.org/10.1186/s12874-021-01299-6 |
work_keys_str_mv | AT jiaoyue anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT lesueurfabienne anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT azencottchloeagathe anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT laurentmaite anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT mebirouknoura anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT labordelilian anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT beauvalletjuana anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT dondonmariegabrielle anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT eonmarchaisseverine anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT laugeanthony anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT noguescatherine anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT andrieunadine anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT stoppalyonnetdominique anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT caputosandrinem anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT jiaoyue newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT lesueurfabienne newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT azencottchloeagathe newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT laurentmaite newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT mebirouknoura newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT labordelilian newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT beauvalletjuana newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT dondonmariegabrielle newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT eonmarchaisseverine newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT laugeanthony newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT noguescatherine newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT andrieunadine newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT stoppalyonnetdominique newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers AT caputosandrinem newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers |