Cargando…

A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers

BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiao, Yue, Lesueur, Fabienne, Azencott, Chloé-Agathe, Laurent, Maïté, Mebirouk, Noura, Laborde, Lilian, Beauvallet, Juana, Dondon, Marie-Gabrielle, Eon-Marchais, Séverine, Laugé, Anthony, Noguès, Catherine, Andrieu, Nadine, Stoppa-Lyonnet, Dominique, Caputo, Sandrine M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320036/
https://www.ncbi.nlm.nih.gov/pubmed/34325649
http://dx.doi.org/10.1186/s12874-021-01299-6
_version_ 1783730568971681792
author Jiao, Yue
Lesueur, Fabienne
Azencott, Chloé-Agathe
Laurent, Maïté
Mebirouk, Noura
Laborde, Lilian
Beauvallet, Juana
Dondon, Marie-Gabrielle
Eon-Marchais, Séverine
Laugé, Anthony
Noguès, Catherine
Andrieu, Nadine
Stoppa-Lyonnet, Dominique
Caputo, Sandrine M.
author_facet Jiao, Yue
Lesueur, Fabienne
Azencott, Chloé-Agathe
Laurent, Maïté
Mebirouk, Noura
Laborde, Lilian
Beauvallet, Juana
Dondon, Marie-Gabrielle
Eon-Marchais, Séverine
Laugé, Anthony
Noguès, Catherine
Andrieu, Nadine
Stoppa-Lyonnet, Dominique
Caputo, Sandrine M.
author_sort Jiao, Yue
collection PubMed
description BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. METHODS: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). CONCLUSIONS: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01299-6.
format Online
Article
Text
id pubmed-8320036
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-83200362021-07-30 A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers Jiao, Yue Lesueur, Fabienne Azencott, Chloé-Agathe Laurent, Maïté Mebirouk, Noura Laborde, Lilian Beauvallet, Juana Dondon, Marie-Gabrielle Eon-Marchais, Séverine Laugé, Anthony Noguès, Catherine Andrieu, Nadine Stoppa-Lyonnet, Dominique Caputo, Sandrine M. BMC Med Res Methodol Research Article BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. METHODS: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). CONCLUSIONS: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01299-6. BioMed Central 2021-07-29 /pmc/articles/PMC8320036/ /pubmed/34325649 http://dx.doi.org/10.1186/s12874-021-01299-6 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Jiao, Yue
Lesueur, Fabienne
Azencott, Chloé-Agathe
Laurent, Maïté
Mebirouk, Noura
Laborde, Lilian
Beauvallet, Juana
Dondon, Marie-Gabrielle
Eon-Marchais, Séverine
Laugé, Anthony
Noguès, Catherine
Andrieu, Nadine
Stoppa-Lyonnet, Dominique
Caputo, Sandrine M.
A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
title A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
title_full A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
title_fullStr A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
title_full_unstemmed A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
title_short A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers
title_sort new hybrid record linkage process to make epidemiological databases interoperable: application to the gemo and genepso studies involving brca1 and brca2 mutation carriers
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320036/
https://www.ncbi.nlm.nih.gov/pubmed/34325649
http://dx.doi.org/10.1186/s12874-021-01299-6
work_keys_str_mv AT jiaoyue anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT lesueurfabienne anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT azencottchloeagathe anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT laurentmaite anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT mebirouknoura anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT labordelilian anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT beauvalletjuana anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT dondonmariegabrielle anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT eonmarchaisseverine anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT laugeanthony anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT noguescatherine anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT andrieunadine anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT stoppalyonnetdominique anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT caputosandrinem anewhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT jiaoyue newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT lesueurfabienne newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT azencottchloeagathe newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT laurentmaite newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT mebirouknoura newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT labordelilian newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT beauvalletjuana newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT dondonmariegabrielle newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT eonmarchaisseverine newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT laugeanthony newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT noguescatherine newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT andrieunadine newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT stoppalyonnetdominique newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers
AT caputosandrinem newhybridrecordlinkageprocesstomakeepidemiologicaldatabasesinteroperableapplicationtothegemoandgenepsostudiesinvolvingbrca1andbrca2mutationcarriers