Cargando…

Towards population-independent, multi-disease detection in fundus photographs

Independent validation studies of automatic diabetic retinopathy screening systems have recently shown a drop of screening performance on external data. Beyond diabetic retinopathy, this study investigates the generalizability of deep learning (DL) algorithms for screening various ocular anomalies i...

Descripción completa

Detalles Bibliográficos
Autores principales: Matta, Sarah, Lamard, Mathieu, Conze, Pierre-Henri, Le Guilcher, Alexandre, Lecat, Clément, Carette, Romuald, Basset, Fabien, Massin, Pascale, Rottier, Jean-Bernard, Cochener, Béatrice, Quellec, Gwenolé
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10352300/
https://www.ncbi.nlm.nih.gov/pubmed/37460629
http://dx.doi.org/10.1038/s41598-023-38610-y
Descripción
Sumario:Independent validation studies of automatic diabetic retinopathy screening systems have recently shown a drop of screening performance on external data. Beyond diabetic retinopathy, this study investigates the generalizability of deep learning (DL) algorithms for screening various ocular anomalies in fundus photographs, across heterogeneous populations and imaging protocols. The following datasets are considered: OPHDIAT (France, diabetic population), OphtaMaine (France, general population), RIADD (India, general population) and ODIR (China, general population). Two multi-disease DL algorithms were developed: a Single-Dataset (SD) network, trained on the largest dataset (OPHDIAT), and a Multiple-Dataset (MD) network, trained on multiple datasets simultaneously. To assess their generalizability, both algorithms were evaluated whenever training and test data originate from overlapping datasets or from disjoint datasets. The SD network achieved a mean per-disease area under the receiver operating characteristic curve (mAUC) of 0.9571 on OPHDIAT. However, it generalized poorly to the other three datasets (mAUC < 0.9). When all four datasets were involved in training, the MD network significantly outperformed the SD network (p = 0.0058), indicating improved generality. However, in leave-one-dataset-out experiments, performance of the MD network was significantly lower on populations unseen during training than on populations involved in training (p < 0.0001), indicating imperfect generalizability.