Cargando…

Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review

BACKGROUND: Dialog agents (chatbots) have a long history of application in health care, where they have been used for tasks such as supporting patient self-management and providing counseling. Their use is expected to grow with increasing demands on health systems and improving artificial intelligen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Abd-Alrazaq, Alaa, Safi, Zeineb, Alajlani, Mohannad, Warren, Jim, Househ, Mowafa, Denecke, Kerstin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7305563/ https://www.ncbi.nlm.nih.gov/pubmed/32442157 http://dx.doi.org/10.2196/18301

_version_	1783548489168322560
author	Abd-Alrazaq, Alaa Safi, Zeineb Alajlani, Mohannad Warren, Jim Househ, Mowafa Denecke, Kerstin
author_facet	Abd-Alrazaq, Alaa Safi, Zeineb Alajlani, Mohannad Warren, Jim Househ, Mowafa Denecke, Kerstin
author_sort	Abd-Alrazaq, Alaa
collection	PubMed
description	BACKGROUND: Dialog agents (chatbots) have a long history of application in health care, where they have been used for tasks such as supporting patient self-management and providing counseling. Their use is expected to grow with increasing demands on health systems and improving artificial intelligence (AI) capability. Approaches to the evaluation of health care chatbots, however, appear to be diverse and haphazard, resulting in a potential barrier to the advancement of the field. OBJECTIVE: This study aims to identify the technical (nonclinical) metrics used by previous studies to evaluate health care chatbots. METHODS: Studies were identified by searching 7 bibliographic databases (eg, MEDLINE and PsycINFO) in addition to conducting backward and forward reference list checking of the included studies and relevant reviews. The studies were independently selected by two reviewers who then extracted data from the included studies. Extracted data were synthesized narratively by grouping the identified metrics into categories based on the aspect of chatbots that the metrics evaluated. RESULTS: Of the 1498 citations retrieved, 65 studies were included in this review. Chatbots were evaluated using 27 technical metrics, which were related to chatbots as a whole (eg, usability, classifier performance, speed), response generation (eg, comprehensibility, realism, repetitiveness), response understanding (eg, chatbot understanding as assessed by users, word error rate, concept error rate), and esthetics (eg, appearance of the virtual agent, background color, and content). CONCLUSIONS: The technical metrics of health chatbot studies were diverse, with survey designs and global usability metrics dominating. The lack of standardization and paucity of objective measures make it difficult to compare the performance of health chatbots and could inhibit advancement of the field. We suggest that researchers more frequently include metrics computed from conversation logs. In addition, we recommend the development of a framework of technical metrics with recommendations for specific circumstances for their inclusion in chatbot studies.
format	Online Article Text
id	pubmed-7305563
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-73055632020-06-24 Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review Abd-Alrazaq, Alaa Safi, Zeineb Alajlani, Mohannad Warren, Jim Househ, Mowafa Denecke, Kerstin J Med Internet Res Review BACKGROUND: Dialog agents (chatbots) have a long history of application in health care, where they have been used for tasks such as supporting patient self-management and providing counseling. Their use is expected to grow with increasing demands on health systems and improving artificial intelligence (AI) capability. Approaches to the evaluation of health care chatbots, however, appear to be diverse and haphazard, resulting in a potential barrier to the advancement of the field. OBJECTIVE: This study aims to identify the technical (nonclinical) metrics used by previous studies to evaluate health care chatbots. METHODS: Studies were identified by searching 7 bibliographic databases (eg, MEDLINE and PsycINFO) in addition to conducting backward and forward reference list checking of the included studies and relevant reviews. The studies were independently selected by two reviewers who then extracted data from the included studies. Extracted data were synthesized narratively by grouping the identified metrics into categories based on the aspect of chatbots that the metrics evaluated. RESULTS: Of the 1498 citations retrieved, 65 studies were included in this review. Chatbots were evaluated using 27 technical metrics, which were related to chatbots as a whole (eg, usability, classifier performance, speed), response generation (eg, comprehensibility, realism, repetitiveness), response understanding (eg, chatbot understanding as assessed by users, word error rate, concept error rate), and esthetics (eg, appearance of the virtual agent, background color, and content). CONCLUSIONS: The technical metrics of health chatbot studies were diverse, with survey designs and global usability metrics dominating. The lack of standardization and paucity of objective measures make it difficult to compare the performance of health chatbots and could inhibit advancement of the field. We suggest that researchers more frequently include metrics computed from conversation logs. In addition, we recommend the development of a framework of technical metrics with recommendations for specific circumstances for their inclusion in chatbot studies. JMIR Publications 2020-06-05 /pmc/articles/PMC7305563/ /pubmed/32442157 http://dx.doi.org/10.2196/18301 Text en ©Alaa Abd-Alrazaq, Zeineb Safi, Mohannad Alajlani, Jim Warren, Mowafa Househ, Kerstin Denecke. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 05.06.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Review Abd-Alrazaq, Alaa Safi, Zeineb Alajlani, Mohannad Warren, Jim Househ, Mowafa Denecke, Kerstin Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review
title	Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review
title_full	Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review
title_fullStr	Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review
title_full_unstemmed	Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review
title_short	Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review
title_sort	technical metrics used to evaluate health care chatbots: scoping review
topic	Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7305563/ https://www.ncbi.nlm.nih.gov/pubmed/32442157 http://dx.doi.org/10.2196/18301
work_keys_str_mv	AT abdalrazaqalaa technicalmetricsusedtoevaluatehealthcarechatbotsscopingreview AT safizeineb technicalmetricsusedtoevaluatehealthcarechatbotsscopingreview AT alajlanimohannad technicalmetricsusedtoevaluatehealthcarechatbotsscopingreview AT warrenjim technicalmetricsusedtoevaluatehealthcarechatbotsscopingreview AT househmowafa technicalmetricsusedtoevaluatehealthcarechatbotsscopingreview AT deneckekerstin technicalmetricsusedtoevaluatehealthcarechatbotsscopingreview

Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review

Ejemplares similares