Cargando…

Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images

A key factor in the fight against viral diseases such as the coronavirus (COVID-19) is the identification of virus carriers as early and quickly as possible, in a cheap and efficient manner. The application of deep learning for image classification of chest X-ray images of COVID-19 patients could be...

Descripción completa

Detalles Bibliográficos
Autores principales: Calderon-Ramirez, Saul, Yang, Shengxiang, Moemeni, Armaghan, Elizondo, David, Colreavy-Donnelly, Simon, Chavarría-Estrada, Luis Fernando, Molina-Cabello, Miguel A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier B.V. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8276579/
https://www.ncbi.nlm.nih.gov/pubmed/34276263
http://dx.doi.org/10.1016/j.asoc.2021.107692
_version_ 1783721930860265472
author Calderon-Ramirez, Saul
Yang, Shengxiang
Moemeni, Armaghan
Elizondo, David
Colreavy-Donnelly, Simon
Chavarría-Estrada, Luis Fernando
Molina-Cabello, Miguel A.
author_facet Calderon-Ramirez, Saul
Yang, Shengxiang
Moemeni, Armaghan
Elizondo, David
Colreavy-Donnelly, Simon
Chavarría-Estrada, Luis Fernando
Molina-Cabello, Miguel A.
author_sort Calderon-Ramirez, Saul
collection PubMed
description A key factor in the fight against viral diseases such as the coronavirus (COVID-19) is the identification of virus carriers as early and quickly as possible, in a cheap and efficient manner. The application of deep learning for image classification of chest X-ray images of COVID-19 patients could become a useful pre-diagnostic detection methodology. However, deep learning architectures require large labelled datasets. This is often a limitation when the subject of research is relatively new as in the case of the virus outbreak, where dealing with small labelled datasets is a challenge. Moreover, in such context, the datasets are also highly imbalanced, with few observations from positive cases of the new disease. In this work we evaluate the performance of the semi-supervised deep learning architecture known as MixMatch with a very limited number of labelled observations and highly imbalanced labelled datasets. We demonstrate the critical impact of data imbalance to the model’s accuracy. Therefore, we propose a simple approach for correcting data imbalance, by re-weighting each observation in the loss function, giving a higher weight to the observations corresponding to the under-represented class. For unlabelled observations, we use the pseudo and augmented labels calculated by MixMatch to choose the appropriate weight. The proposed method improved classification accuracy by up to 18%, with respect to the non balanced MixMatch algorithm. We tested our proposed approach with several available datasets using 10, 15 and 20 labelled observations, for binary classification (COVID-19 positive and normal cases). For multi-class classification (COVID-19 positive, pneumonia and normal cases), we tested 30, 50, 70 and 90 labelled observations. Additionally, a new dataset is included among the tested datasets, composed of chest X-ray images of Costa Rican adult patients.
format Online
Article
Text
id pubmed-8276579
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier B.V.
record_format MEDLINE/PubMed
spelling pubmed-82765792021-07-14 Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images Calderon-Ramirez, Saul Yang, Shengxiang Moemeni, Armaghan Elizondo, David Colreavy-Donnelly, Simon Chavarría-Estrada, Luis Fernando Molina-Cabello, Miguel A. Appl Soft Comput Article A key factor in the fight against viral diseases such as the coronavirus (COVID-19) is the identification of virus carriers as early and quickly as possible, in a cheap and efficient manner. The application of deep learning for image classification of chest X-ray images of COVID-19 patients could become a useful pre-diagnostic detection methodology. However, deep learning architectures require large labelled datasets. This is often a limitation when the subject of research is relatively new as in the case of the virus outbreak, where dealing with small labelled datasets is a challenge. Moreover, in such context, the datasets are also highly imbalanced, with few observations from positive cases of the new disease. In this work we evaluate the performance of the semi-supervised deep learning architecture known as MixMatch with a very limited number of labelled observations and highly imbalanced labelled datasets. We demonstrate the critical impact of data imbalance to the model’s accuracy. Therefore, we propose a simple approach for correcting data imbalance, by re-weighting each observation in the loss function, giving a higher weight to the observations corresponding to the under-represented class. For unlabelled observations, we use the pseudo and augmented labels calculated by MixMatch to choose the appropriate weight. The proposed method improved classification accuracy by up to 18%, with respect to the non balanced MixMatch algorithm. We tested our proposed approach with several available datasets using 10, 15 and 20 labelled observations, for binary classification (COVID-19 positive and normal cases). For multi-class classification (COVID-19 positive, pneumonia and normal cases), we tested 30, 50, 70 and 90 labelled observations. Additionally, a new dataset is included among the tested datasets, composed of chest X-ray images of Costa Rican adult patients. Elsevier B.V. 2021-11 2021-07-13 /pmc/articles/PMC8276579/ /pubmed/34276263 http://dx.doi.org/10.1016/j.asoc.2021.107692 Text en © 2021 Elsevier B.V. All rights reserved. Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle Article
Calderon-Ramirez, Saul
Yang, Shengxiang
Moemeni, Armaghan
Elizondo, David
Colreavy-Donnelly, Simon
Chavarría-Estrada, Luis Fernando
Molina-Cabello, Miguel A.
Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images
title Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images
title_full Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images
title_fullStr Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images
title_full_unstemmed Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images
title_short Correcting data imbalance for semi-supervised COVID-19 detection using X-ray chest images
title_sort correcting data imbalance for semi-supervised covid-19 detection using x-ray chest images
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8276579/
https://www.ncbi.nlm.nih.gov/pubmed/34276263
http://dx.doi.org/10.1016/j.asoc.2021.107692
work_keys_str_mv AT calderonramirezsaul correctingdataimbalanceforsemisupervisedcovid19detectionusingxraychestimages
AT yangshengxiang correctingdataimbalanceforsemisupervisedcovid19detectionusingxraychestimages
AT moemeniarmaghan correctingdataimbalanceforsemisupervisedcovid19detectionusingxraychestimages
AT elizondodavid correctingdataimbalanceforsemisupervisedcovid19detectionusingxraychestimages
AT colreavydonnellysimon correctingdataimbalanceforsemisupervisedcovid19detectionusingxraychestimages
AT chavarriaestradaluisfernando correctingdataimbalanceforsemisupervisedcovid19detectionusingxraychestimages
AT molinacabellomiguela correctingdataimbalanceforsemisupervisedcovid19detectionusingxraychestimages