Cargando…
A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text
BACKGROUND: Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in vario...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6454584/ https://www.ncbi.nlm.nih.gov/pubmed/30961602 http://dx.doi.org/10.1186/s12911-019-0770-7 |
_version_ | 1783409564816769024 |
---|---|
author | Xiong, Ying Wang, Zhongmin Jiang, Dehuan Wang, Xiaolong Chen, Qingcai Xu, Hua Yan, Jun Tang, Buzhou |
author_facet | Xiong, Ying Wang, Zhongmin Jiang, Dehuan Wang, Xiaolong Chen, Qingcai Xu, Hua Yan, Jun Tang, Buzhou |
author_sort | Xiong, Ying |
collection | PubMed |
description | BACKGROUND: Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words. METHODS: In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus. RESULTS: When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure. CONCLUSIONS: Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus. |
format | Online Article Text |
id | pubmed-6454584 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-64545842019-04-17 A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text Xiong, Ying Wang, Zhongmin Jiang, Dehuan Wang, Xiaolong Chen, Qingcai Xu, Hua Yan, Jun Tang, Buzhou BMC Med Inform Decis Mak Research BACKGROUND: Chinese word segmentation (CWS) and part-of-speech (POS) tagging are two fundamental tasks of Chinese text processing. They are usually preliminary steps for lots of Chinese natural language processing (NLP) tasks. There have been a large number of studies on CWS and POS tagging in various domains, however, few studies have been proposed for CWS and POS tagging in the clinical domain as it is not easy to determine granularity of words. METHODS: In this paper, we investigated CWS and POS tagging for Chinese clinical text at a fine-granularity level, and manually annotated a corpus. On the corpus, we compared two state-of-the-art methods, i.e., conditional random fields (CRF) and bidirectional long short-term memory (BiLSTM) with a CRF layer. In order to validate the plausibility of the fine-grained annotation, we further investigated the effect of CWS and POS tagging on Chinese clinical named entity recognition (NER) on another independent corpus. RESULTS: When only CWS was considered, CRF achieved higher precision, recall and F-measure than BiLSTM-CRF. When both CWS and POS tagging were considered, CRF also gained an advantage over BiLSTM. CRF outperformed BiLSTM-CRF by 0.14% in F-measure on CWS and by 0.34% in F-measure on POS tagging. The CWS information brought a greatest improvement of 0.34% in F-measure, while the CWS&POS information brought a greatest improvement of 0.74% in F-measure. CONCLUSIONS: Our proposed fine-grained CWS and POS tagging corpus is reliable and meaningful as the output of the CWS and POS tagging systems developed on this corpus improved the performance of a Chinese clinical NER system on another independent corpus. BioMed Central 2019-04-09 /pmc/articles/PMC6454584/ /pubmed/30961602 http://dx.doi.org/10.1186/s12911-019-0770-7 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Xiong, Ying Wang, Zhongmin Jiang, Dehuan Wang, Xiaolong Chen, Qingcai Xu, Hua Yan, Jun Tang, Buzhou A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text |
title | A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text |
title_full | A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text |
title_fullStr | A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text |
title_full_unstemmed | A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text |
title_short | A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text |
title_sort | fine-grained chinese word segmentation and part-of-speech tagging corpus for clinical text |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6454584/ https://www.ncbi.nlm.nih.gov/pubmed/30961602 http://dx.doi.org/10.1186/s12911-019-0770-7 |
work_keys_str_mv | AT xiongying afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT wangzhongmin afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT jiangdehuan afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT wangxiaolong afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT chenqingcai afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT xuhua afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT yanjun afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT tangbuzhou afinegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT xiongying finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT wangzhongmin finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT jiangdehuan finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT wangxiaolong finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT chenqingcai finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT xuhua finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT yanjun finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext AT tangbuzhou finegrainedchinesewordsegmentationandpartofspeechtaggingcorpusforclinicaltext |