Cargando…

Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports

BACKGROUND: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications. METHODS: We tested the F1 score, precision, and recall to compare...

Descripción completa

Detalles Bibliográficos
Autores principales: Casey, Arlene, Davidson, Emma, Grover, Claire, Tobin, Richard, Grivas, Andreas, Zhang, Huayu, Schrempf, Patrick, O’Neil, Alison Q., Lee, Liam, Walsh, Michael, Pellie, Freya, Ferguson, Karen, Cvoro, Vera, Wu, Honghan, Whalley, Heather, Mair, Grant, Whiteley, William, Alex, Beatrice
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569314/
https://www.ncbi.nlm.nih.gov/pubmed/37840686
http://dx.doi.org/10.3389/fdgth.2023.1184919
_version_ 1785119526445121536
author Casey, Arlene
Davidson, Emma
Grover, Claire
Tobin, Richard
Grivas, Andreas
Zhang, Huayu
Schrempf, Patrick
O’Neil, Alison Q.
Lee, Liam
Walsh, Michael
Pellie, Freya
Ferguson, Karen
Cvoro, Vera
Wu, Honghan
Whalley, Heather
Mair, Grant
Whiteley, William
Alex, Beatrice
author_facet Casey, Arlene
Davidson, Emma
Grover, Claire
Tobin, Richard
Grivas, Andreas
Zhang, Huayu
Schrempf, Patrick
O’Neil, Alison Q.
Lee, Liam
Walsh, Michael
Pellie, Freya
Ferguson, Karen
Cvoro, Vera
Wu, Honghan
Whalley, Heather
Mair, Grant
Whiteley, William
Alex, Beatrice
author_sort Casey, Arlene
collection PubMed
description BACKGROUND: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications. METHODS: We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images. RESULTS: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%. CONCLUSIONS: The four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task.
format Online
Article
Text
id pubmed-10569314
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-105693142023-10-13 Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports Casey, Arlene Davidson, Emma Grover, Claire Tobin, Richard Grivas, Andreas Zhang, Huayu Schrempf, Patrick O’Neil, Alison Q. Lee, Liam Walsh, Michael Pellie, Freya Ferguson, Karen Cvoro, Vera Wu, Honghan Whalley, Heather Mair, Grant Whiteley, William Alex, Beatrice Front Digit Health Digital Health BACKGROUND: Natural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications. METHODS: We tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images. RESULTS: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%. CONCLUSIONS: The four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task. Frontiers Media S.A. 2023-09-28 /pmc/articles/PMC10569314/ /pubmed/37840686 http://dx.doi.org/10.3389/fdgth.2023.1184919 Text en © 2023 Casey, Davidson, Grover, Tobin, Grivas, Zhang, Schrempf, O’Neil, Lee, Walsh, Pellie, Ferguson, Cvero, Wu, Whalley, Mair, Whiteley and Alex. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) (https://creativecommons.org/licenses/by/4.0/) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Digital Health
Casey, Arlene
Davidson, Emma
Grover, Claire
Tobin, Richard
Grivas, Andreas
Zhang, Huayu
Schrempf, Patrick
O’Neil, Alison Q.
Lee, Liam
Walsh, Michael
Pellie, Freya
Ferguson, Karen
Cvoro, Vera
Wu, Honghan
Whalley, Heather
Mair, Grant
Whiteley, William
Alex, Beatrice
Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
title Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
title_full Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
title_fullStr Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
title_full_unstemmed Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
title_short Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports
title_sort understanding the performance and reliability of nlp tools: a comparison of four nlp tools predicting stroke phenotypes in radiology reports
topic Digital Health
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569314/
https://www.ncbi.nlm.nih.gov/pubmed/37840686
http://dx.doi.org/10.3389/fdgth.2023.1184919
work_keys_str_mv AT caseyarlene understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT davidsonemma understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT groverclaire understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT tobinrichard understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT grivasandreas understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT zhanghuayu understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT schrempfpatrick understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT oneilalisonq understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT leeliam understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT walshmichael understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT pelliefreya understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT fergusonkaren understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT cvorovera understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT wuhonghan understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT whalleyheather understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT mairgrant understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT whiteleywilliam understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports
AT alexbeatrice understandingtheperformanceandreliabilityofnlptoolsacomparisonoffournlptoolspredictingstrokephenotypesinradiologyreports