Cargando…

A Chinese telemedicine-dialogue dataset annotated for named entities

BACKGROUND: A large collection of dialogues between patients and doctors must be annotated for medical named entities to build intelligence for telemedicine. However, since most patients involved in telemedicine deliver related named entities in informal and long multiword expressions, it is challen...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Shanshan, Yan, Yajing, Yan, Rong, Li, Ting, Ma, Kaijie, Yan, Yani
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10655334/
https://www.ncbi.nlm.nih.gov/pubmed/37974215
http://dx.doi.org/10.1186/s12911-023-02365-3
_version_ 1785147923780075520
author Wang, Shanshan
Yan, Yajing
Yan, Rong
Li, Ting
Ma, Kaijie
Yan, Yani
author_facet Wang, Shanshan
Yan, Yajing
Yan, Rong
Li, Ting
Ma, Kaijie
Yan, Yani
author_sort Wang, Shanshan
collection PubMed
description BACKGROUND: A large collection of dialogues between patients and doctors must be annotated for medical named entities to build intelligence for telemedicine. However, since most patients involved in telemedicine deliver related named entities in informal and long multiword expressions, it is challenging to tag their telemedicine dialogue data. This study aims to address this issue. METHODS: With the telemedicine dialogue dataset for obstetrics and gynecology taken from haodf.com, we developed guidelines and followed a two-round procedure to tag six types of named entities, including disease, symptom, time, pharmaceutical, operation, and examination. Additionally, we developed four deep-learning models based on this dataset to establish a benchmark for named-entity recognition (NER). RESULTS: The distilled obstetrics and gynecology dataset contains 2,383 consultations between doctors and patients, of which 13,411 sentences were from doctors, and 17,929 were from patients. With 63,560 named entities in total, the average number of characters per named entity is 4.33. The experimental results suggest that LatticeLSTM performs best on our dataset in terms of accuracy, precision, recall, and F score. CONCLUSION: Compared with other datasets, this dataset offers three novel facets. This study offers intricately tagged long multiword expressions for medical named entities. Second, this study is one of the first attempts to mark temporal entities in a medical dataset. Third, this annotated dataset is balanced across the six types of labels, which we believe will play a considerable role in expanding telemedicine artificial intelligence.
format Online
Article
Text
id pubmed-10655334
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-106553342023-11-16 A Chinese telemedicine-dialogue dataset annotated for named entities Wang, Shanshan Yan, Yajing Yan, Rong Li, Ting Ma, Kaijie Yan, Yani BMC Med Inform Decis Mak Research BACKGROUND: A large collection of dialogues between patients and doctors must be annotated for medical named entities to build intelligence for telemedicine. However, since most patients involved in telemedicine deliver related named entities in informal and long multiword expressions, it is challenging to tag their telemedicine dialogue data. This study aims to address this issue. METHODS: With the telemedicine dialogue dataset for obstetrics and gynecology taken from haodf.com, we developed guidelines and followed a two-round procedure to tag six types of named entities, including disease, symptom, time, pharmaceutical, operation, and examination. Additionally, we developed four deep-learning models based on this dataset to establish a benchmark for named-entity recognition (NER). RESULTS: The distilled obstetrics and gynecology dataset contains 2,383 consultations between doctors and patients, of which 13,411 sentences were from doctors, and 17,929 were from patients. With 63,560 named entities in total, the average number of characters per named entity is 4.33. The experimental results suggest that LatticeLSTM performs best on our dataset in terms of accuracy, precision, recall, and F score. CONCLUSION: Compared with other datasets, this dataset offers three novel facets. This study offers intricately tagged long multiword expressions for medical named entities. Second, this study is one of the first attempts to mark temporal entities in a medical dataset. Third, this annotated dataset is balanced across the six types of labels, which we believe will play a considerable role in expanding telemedicine artificial intelligence. BioMed Central 2023-11-16 /pmc/articles/PMC10655334/ /pubmed/37974215 http://dx.doi.org/10.1186/s12911-023-02365-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Wang, Shanshan
Yan, Yajing
Yan, Rong
Li, Ting
Ma, Kaijie
Yan, Yani
A Chinese telemedicine-dialogue dataset annotated for named entities
title A Chinese telemedicine-dialogue dataset annotated for named entities
title_full A Chinese telemedicine-dialogue dataset annotated for named entities
title_fullStr A Chinese telemedicine-dialogue dataset annotated for named entities
title_full_unstemmed A Chinese telemedicine-dialogue dataset annotated for named entities
title_short A Chinese telemedicine-dialogue dataset annotated for named entities
title_sort chinese telemedicine-dialogue dataset annotated for named entities
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10655334/
https://www.ncbi.nlm.nih.gov/pubmed/37974215
http://dx.doi.org/10.1186/s12911-023-02365-3
work_keys_str_mv AT wangshanshan achinesetelemedicinedialoguedatasetannotatedfornamedentities
AT yanyajing achinesetelemedicinedialoguedatasetannotatedfornamedentities
AT yanrong achinesetelemedicinedialoguedatasetannotatedfornamedentities
AT liting achinesetelemedicinedialoguedatasetannotatedfornamedentities
AT makaijie achinesetelemedicinedialoguedatasetannotatedfornamedentities
AT yanyani achinesetelemedicinedialoguedatasetannotatedfornamedentities
AT wangshanshan chinesetelemedicinedialoguedatasetannotatedfornamedentities
AT yanyajing chinesetelemedicinedialoguedatasetannotatedfornamedentities
AT yanrong chinesetelemedicinedialoguedatasetannotatedfornamedentities
AT liting chinesetelemedicinedialoguedatasetannotatedfornamedentities
AT makaijie chinesetelemedicinedialoguedatasetannotatedfornamedentities
AT yanyani chinesetelemedicinedialoguedatasetannotatedfornamedentities