Cargando…

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

INTRODUCTION. Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learn...

Descripción completa

Detalles Bibliográficos
Autores principales:	Low, Daniel M., Rao, Vishwanatha, Randolph, Gregory, Song, Phillip C., Ghosh, Satrajit S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7836138/ https://www.ncbi.nlm.nih.gov/pubmed/33501466 http://dx.doi.org/10.1101/2020.11.23.20235945

_version_	1783642682933903360
author	Low, Daniel M. Rao, Vishwanatha Randolph, Gregory Song, Phillip C. Ghosh, Satrajit S.
author_facet	Low, Daniel M. Rao, Vishwanatha Randolph, Gregory Song, Phillip C. Ghosh, Satrajit S.
author_sort	Low, Daniel M.
collection	PubMed
description	INTRODUCTION. Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance. METHODS. Patients with confirmed UVFP through endoscopic examination (N=77) and controls with normal voices matched for age and sex (N=77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel “a”. Four machine learning models of differing complexity were used. SHapley Additive exPlanations (SHAP) was used to identify important features. RESULTS. The highest median bootstrapped ROC AUC score was 0.87 and beat clinician’s performance (range: 0.74 – 0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis. CONCLUSION. We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician’s ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.
format	Online Article Text
id	pubmed-7836138
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory
record_format	MEDLINE/PubMed
spelling	pubmed-78361382021-01-27 Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings Low, Daniel M. Rao, Vishwanatha Randolph, Gregory Song, Phillip C. Ghosh, Satrajit S. medRxiv Article INTRODUCTION. Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance. METHODS. Patients with confirmed UVFP through endoscopic examination (N=77) and controls with normal voices matched for age and sex (N=77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel “a”. Four machine learning models of differing complexity were used. SHapley Additive exPlanations (SHAP) was used to identify important features. RESULTS. The highest median bootstrapped ROC AUC score was 0.87 and beat clinician’s performance (range: 0.74 – 0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis. CONCLUSION. We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician’s ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology. Cold Spring Harbor Laboratory 2023-10-23 /pmc/articles/PMC7836138/ /pubmed/33501466 http://dx.doi.org/10.1101/2020.11.23.20235945 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle	Article Low, Daniel M. Rao, Vishwanatha Randolph, Gregory Song, Phillip C. Ghosh, Satrajit S. Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings
title	Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings
title_full	Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings
title_fullStr	Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings
title_full_unstemmed	Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings
title_short	Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings
title_sort	identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7836138/ https://www.ncbi.nlm.nih.gov/pubmed/33501466 http://dx.doi.org/10.1101/2020.11.23.20235945
work_keys_str_mv	AT lowdanielm identifyingbiasinmodelsthatdetectvocalfoldparalysisfromaudiorecordingsusingexplainablemachinelearningandclinicianratings AT raovishwanatha identifyingbiasinmodelsthatdetectvocalfoldparalysisfromaudiorecordingsusingexplainablemachinelearningandclinicianratings AT randolphgregory identifyingbiasinmodelsthatdetectvocalfoldparalysisfromaudiorecordingsusingexplainablemachinelearningandclinicianratings AT songphillipc identifyingbiasinmodelsthatdetectvocalfoldparalysisfromaudiorecordingsusingexplainablemachinelearningandclinicianratings AT ghoshsatrajits identifyingbiasinmodelsthatdetectvocalfoldparalysisfromaudiorecordingsusingexplainablemachinelearningandclinicianratings

Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings

Ejemplares similares