Cargando…

Validation pipeline for machine learning algorithm assessment for multiple vendors

A standardized objective evaluation method is needed to compare machine learning (ML) algorithms as these tools become available for clinical use. Therefore, we designed, built, and tested an evaluation pipeline with the goal of normalizing performance measurement of independently developed algorith...

Descripción completa

Detalles Bibliográficos
Autores principales: Bizzo, Bernardo C., Ebrahimian, Shadi, Walters, Mark E., Michalski, Mark H., Andriole, Katherine P., Dreyer, Keith J., Kalra, Mannudeep K., Alkasab, Tarik, Digumarthy, Subba R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9053776/
https://www.ncbi.nlm.nih.gov/pubmed/35486572
http://dx.doi.org/10.1371/journal.pone.0267213
_version_ 1784697044585152512
author Bizzo, Bernardo C.
Ebrahimian, Shadi
Walters, Mark E.
Michalski, Mark H.
Andriole, Katherine P.
Dreyer, Keith J.
Kalra, Mannudeep K.
Alkasab, Tarik
Digumarthy, Subba R.
author_facet Bizzo, Bernardo C.
Ebrahimian, Shadi
Walters, Mark E.
Michalski, Mark H.
Andriole, Katherine P.
Dreyer, Keith J.
Kalra, Mannudeep K.
Alkasab, Tarik
Digumarthy, Subba R.
author_sort Bizzo, Bernardo C.
collection PubMed
description A standardized objective evaluation method is needed to compare machine learning (ML) algorithms as these tools become available for clinical use. Therefore, we designed, built, and tested an evaluation pipeline with the goal of normalizing performance measurement of independently developed algorithms, using a common test dataset of our clinical imaging. Three vendor applications for detecting solid, part-solid, and groundglass lung nodules in chest CT examinations were assessed in this retrospective study using our data-preprocessing and algorithm assessment chain. The pipeline included tools for image cohort creation and de-identification; report and image annotation for ground-truth labeling; server partitioning to receive vendor “black box” algorithms and to enable model testing on our internal clinical data (100 chest CTs with 243 nodules) from within our security firewall; model validation and result visualization; and performance assessment calculating algorithm recall, precision, and receiver operating characteristic curves (ROC). Algorithm true positives, false positives, false negatives, recall, and precision for detecting lung nodules were as follows: Vendor-1 (194, 23, 49, 0.80, 0.89); Vendor-2 (182, 270, 61, 0.75, 0.40); Vendor-3 (75, 120, 168, 0.32, 0.39). The AUCs for detection of solid (0.61–0.74), groundglass (0.66–0.86) and part-solid (0.52–0.86) nodules varied between the three vendors. Our ML model validation pipeline enabled testing of multi-vendor algorithms within the institutional firewall. Wide variations in algorithm performance for detection as well as classification of lung nodules justifies the premise for a standardized objective ML algorithm evaluation process.
format Online
Article
Text
id pubmed-9053776
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-90537762022-04-30 Validation pipeline for machine learning algorithm assessment for multiple vendors Bizzo, Bernardo C. Ebrahimian, Shadi Walters, Mark E. Michalski, Mark H. Andriole, Katherine P. Dreyer, Keith J. Kalra, Mannudeep K. Alkasab, Tarik Digumarthy, Subba R. PLoS One Research Article A standardized objective evaluation method is needed to compare machine learning (ML) algorithms as these tools become available for clinical use. Therefore, we designed, built, and tested an evaluation pipeline with the goal of normalizing performance measurement of independently developed algorithms, using a common test dataset of our clinical imaging. Three vendor applications for detecting solid, part-solid, and groundglass lung nodules in chest CT examinations were assessed in this retrospective study using our data-preprocessing and algorithm assessment chain. The pipeline included tools for image cohort creation and de-identification; report and image annotation for ground-truth labeling; server partitioning to receive vendor “black box” algorithms and to enable model testing on our internal clinical data (100 chest CTs with 243 nodules) from within our security firewall; model validation and result visualization; and performance assessment calculating algorithm recall, precision, and receiver operating characteristic curves (ROC). Algorithm true positives, false positives, false negatives, recall, and precision for detecting lung nodules were as follows: Vendor-1 (194, 23, 49, 0.80, 0.89); Vendor-2 (182, 270, 61, 0.75, 0.40); Vendor-3 (75, 120, 168, 0.32, 0.39). The AUCs for detection of solid (0.61–0.74), groundglass (0.66–0.86) and part-solid (0.52–0.86) nodules varied between the three vendors. Our ML model validation pipeline enabled testing of multi-vendor algorithms within the institutional firewall. Wide variations in algorithm performance for detection as well as classification of lung nodules justifies the premise for a standardized objective ML algorithm evaluation process. Public Library of Science 2022-04-29 /pmc/articles/PMC9053776/ /pubmed/35486572 http://dx.doi.org/10.1371/journal.pone.0267213 Text en © 2022 Bizzo et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Bizzo, Bernardo C.
Ebrahimian, Shadi
Walters, Mark E.
Michalski, Mark H.
Andriole, Katherine P.
Dreyer, Keith J.
Kalra, Mannudeep K.
Alkasab, Tarik
Digumarthy, Subba R.
Validation pipeline for machine learning algorithm assessment for multiple vendors
title Validation pipeline for machine learning algorithm assessment for multiple vendors
title_full Validation pipeline for machine learning algorithm assessment for multiple vendors
title_fullStr Validation pipeline for machine learning algorithm assessment for multiple vendors
title_full_unstemmed Validation pipeline for machine learning algorithm assessment for multiple vendors
title_short Validation pipeline for machine learning algorithm assessment for multiple vendors
title_sort validation pipeline for machine learning algorithm assessment for multiple vendors
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9053776/
https://www.ncbi.nlm.nih.gov/pubmed/35486572
http://dx.doi.org/10.1371/journal.pone.0267213
work_keys_str_mv AT bizzobernardoc validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT ebrahimianshadi validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT waltersmarke validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT michalskimarkh validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT andriolekatherinep validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT dreyerkeithj validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT kalramannudeepk validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT alkasabtarik validationpipelineformachinelearningalgorithmassessmentformultiplevendors
AT digumarthysubbar validationpipelineformachinelearningalgorithmassessmentformultiplevendors