Cargando…

Effect of data harmonization of multicentric dataset in ASD/TD classification

Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typ...

Descripción completa

Detalles Bibliográficos
Autores principales: Serra, Giacomo, Mainas, Francesca, Golosio, Bruno, Retico, Alessandra, Oliva, Piernicola
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Berlin Heidelberg 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10676338/
https://www.ncbi.nlm.nih.gov/pubmed/38006422
http://dx.doi.org/10.1186/s40708-023-00210-x
_version_ 1785149920573915136
author Serra, Giacomo
Mainas, Francesca
Golosio, Bruno
Retico, Alessandra
Oliva, Piernicola
author_facet Serra, Giacomo
Mainas, Francesca
Golosio, Bruno
Retico, Alessandra
Oliva, Piernicola
author_sort Serra, Giacomo
collection PubMed
description Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typically obtained collecting data from multiple acquisition centers. However, analyzing large multicentric datasets can introduce bias due to differences between acquisition centers. ComBat harmonization is commonly used to address batch effects, but it can lead to data leakage when the entire dataset is used to estimate model parameters. In this study, structural and functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) collection were used to classify subjects with Autism Spectrum Disorders (ASD) compared to Typical Developing controls (TD). We compared the classical approach (external harmonization) in which harmonization is performed before train/test split, with an harmonization calculated only on the train set (internal harmonization), and with the dataset with no harmonization. The results showed that harmonization using the whole dataset achieved higher discrimination performance, while non-harmonized data and harmonization using only the train set showed similar results, for both structural and connectivity features. We also showed that the higher performances of the external harmonization are not due to larger size of the sample for the estimation of the model and hence these improved performance with the entire dataset may be ascribed to data leakage. In order to prevent this leakage, it is recommended to define the harmonization model solely using the train set. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40708-023-00210-x.
format Online
Article
Text
id pubmed-10676338
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer Berlin Heidelberg
record_format MEDLINE/PubMed
spelling pubmed-106763382023-11-25 Effect of data harmonization of multicentric dataset in ASD/TD classification Serra, Giacomo Mainas, Francesca Golosio, Bruno Retico, Alessandra Oliva, Piernicola Brain Inform Research Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typically obtained collecting data from multiple acquisition centers. However, analyzing large multicentric datasets can introduce bias due to differences between acquisition centers. ComBat harmonization is commonly used to address batch effects, but it can lead to data leakage when the entire dataset is used to estimate model parameters. In this study, structural and functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) collection were used to classify subjects with Autism Spectrum Disorders (ASD) compared to Typical Developing controls (TD). We compared the classical approach (external harmonization) in which harmonization is performed before train/test split, with an harmonization calculated only on the train set (internal harmonization), and with the dataset with no harmonization. The results showed that harmonization using the whole dataset achieved higher discrimination performance, while non-harmonized data and harmonization using only the train set showed similar results, for both structural and connectivity features. We also showed that the higher performances of the external harmonization are not due to larger size of the sample for the estimation of the model and hence these improved performance with the entire dataset may be ascribed to data leakage. In order to prevent this leakage, it is recommended to define the harmonization model solely using the train set. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s40708-023-00210-x. Springer Berlin Heidelberg 2023-11-25 /pmc/articles/PMC10676338/ /pubmed/38006422 http://dx.doi.org/10.1186/s40708-023-00210-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research
Serra, Giacomo
Mainas, Francesca
Golosio, Bruno
Retico, Alessandra
Oliva, Piernicola
Effect of data harmonization of multicentric dataset in ASD/TD classification
title Effect of data harmonization of multicentric dataset in ASD/TD classification
title_full Effect of data harmonization of multicentric dataset in ASD/TD classification
title_fullStr Effect of data harmonization of multicentric dataset in ASD/TD classification
title_full_unstemmed Effect of data harmonization of multicentric dataset in ASD/TD classification
title_short Effect of data harmonization of multicentric dataset in ASD/TD classification
title_sort effect of data harmonization of multicentric dataset in asd/td classification
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10676338/
https://www.ncbi.nlm.nih.gov/pubmed/38006422
http://dx.doi.org/10.1186/s40708-023-00210-x
work_keys_str_mv AT serragiacomo effectofdataharmonizationofmulticentricdatasetinasdtdclassification
AT mainasfrancesca effectofdataharmonizationofmulticentricdatasetinasdtdclassification
AT golosiobruno effectofdataharmonizationofmulticentricdatasetinasdtdclassification
AT reticoalessandra effectofdataharmonizationofmulticentricdatasetinasdtdclassification
AT olivapiernicola effectofdataharmonizationofmulticentricdatasetinasdtdclassification