Cargando…
A Chinese telemedicine-dialogue dataset annotated for named entities
BACKGROUND: A large collection of dialogues between patients and doctors must be annotated for medical named entities to build intelligence for telemedicine. However, since most patients involved in telemedicine deliver related named entities in informal and long multiword expressions, it is challen...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10655334/ https://www.ncbi.nlm.nih.gov/pubmed/37974215 http://dx.doi.org/10.1186/s12911-023-02365-3 |
_version_ | 1785147923780075520 |
---|---|
author | Wang, Shanshan Yan, Yajing Yan, Rong Li, Ting Ma, Kaijie Yan, Yani |
author_facet | Wang, Shanshan Yan, Yajing Yan, Rong Li, Ting Ma, Kaijie Yan, Yani |
author_sort | Wang, Shanshan |
collection | PubMed |
description | BACKGROUND: A large collection of dialogues between patients and doctors must be annotated for medical named entities to build intelligence for telemedicine. However, since most patients involved in telemedicine deliver related named entities in informal and long multiword expressions, it is challenging to tag their telemedicine dialogue data. This study aims to address this issue. METHODS: With the telemedicine dialogue dataset for obstetrics and gynecology taken from haodf.com, we developed guidelines and followed a two-round procedure to tag six types of named entities, including disease, symptom, time, pharmaceutical, operation, and examination. Additionally, we developed four deep-learning models based on this dataset to establish a benchmark for named-entity recognition (NER). RESULTS: The distilled obstetrics and gynecology dataset contains 2,383 consultations between doctors and patients, of which 13,411 sentences were from doctors, and 17,929 were from patients. With 63,560 named entities in total, the average number of characters per named entity is 4.33. The experimental results suggest that LatticeLSTM performs best on our dataset in terms of accuracy, precision, recall, and F score. CONCLUSION: Compared with other datasets, this dataset offers three novel facets. This study offers intricately tagged long multiword expressions for medical named entities. Second, this study is one of the first attempts to mark temporal entities in a medical dataset. Third, this annotated dataset is balanced across the six types of labels, which we believe will play a considerable role in expanding telemedicine artificial intelligence. |
format | Online Article Text |
id | pubmed-10655334 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-106553342023-11-16 A Chinese telemedicine-dialogue dataset annotated for named entities Wang, Shanshan Yan, Yajing Yan, Rong Li, Ting Ma, Kaijie Yan, Yani BMC Med Inform Decis Mak Research BACKGROUND: A large collection of dialogues between patients and doctors must be annotated for medical named entities to build intelligence for telemedicine. However, since most patients involved in telemedicine deliver related named entities in informal and long multiword expressions, it is challenging to tag their telemedicine dialogue data. This study aims to address this issue. METHODS: With the telemedicine dialogue dataset for obstetrics and gynecology taken from haodf.com, we developed guidelines and followed a two-round procedure to tag six types of named entities, including disease, symptom, time, pharmaceutical, operation, and examination. Additionally, we developed four deep-learning models based on this dataset to establish a benchmark for named-entity recognition (NER). RESULTS: The distilled obstetrics and gynecology dataset contains 2,383 consultations between doctors and patients, of which 13,411 sentences were from doctors, and 17,929 were from patients. With 63,560 named entities in total, the average number of characters per named entity is 4.33. The experimental results suggest that LatticeLSTM performs best on our dataset in terms of accuracy, precision, recall, and F score. CONCLUSION: Compared with other datasets, this dataset offers three novel facets. This study offers intricately tagged long multiword expressions for medical named entities. Second, this study is one of the first attempts to mark temporal entities in a medical dataset. Third, this annotated dataset is balanced across the six types of labels, which we believe will play a considerable role in expanding telemedicine artificial intelligence. BioMed Central 2023-11-16 /pmc/articles/PMC10655334/ /pubmed/37974215 http://dx.doi.org/10.1186/s12911-023-02365-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Wang, Shanshan Yan, Yajing Yan, Rong Li, Ting Ma, Kaijie Yan, Yani A Chinese telemedicine-dialogue dataset annotated for named entities |
title | A Chinese telemedicine-dialogue dataset annotated for named entities |
title_full | A Chinese telemedicine-dialogue dataset annotated for named entities |
title_fullStr | A Chinese telemedicine-dialogue dataset annotated for named entities |
title_full_unstemmed | A Chinese telemedicine-dialogue dataset annotated for named entities |
title_short | A Chinese telemedicine-dialogue dataset annotated for named entities |
title_sort | chinese telemedicine-dialogue dataset annotated for named entities |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10655334/ https://www.ncbi.nlm.nih.gov/pubmed/37974215 http://dx.doi.org/10.1186/s12911-023-02365-3 |
work_keys_str_mv | AT wangshanshan achinesetelemedicinedialoguedatasetannotatedfornamedentities AT yanyajing achinesetelemedicinedialoguedatasetannotatedfornamedentities AT yanrong achinesetelemedicinedialoguedatasetannotatedfornamedentities AT liting achinesetelemedicinedialoguedatasetannotatedfornamedentities AT makaijie achinesetelemedicinedialoguedatasetannotatedfornamedentities AT yanyani achinesetelemedicinedialoguedatasetannotatedfornamedentities AT wangshanshan chinesetelemedicinedialoguedatasetannotatedfornamedentities AT yanyajing chinesetelemedicinedialoguedatasetannotatedfornamedentities AT yanrong chinesetelemedicinedialoguedatasetannotatedfornamedentities AT liting chinesetelemedicinedialoguedatasetannotatedfornamedentities AT makaijie chinesetelemedicinedialoguedatasetannotatedfornamedentities AT yanyani chinesetelemedicinedialoguedatasetannotatedfornamedentities |