Cargando…

Chinese text classification by combining Chinese-BERTology-wwm and GCN

Text classification is an important and classic application in natural language processing (NLP). Recent studies have shown that graph neural networks (GNNs) are effective in tasks with rich structural relationships and serve as effective transductive learning approaches. Text representation learnin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xu, Xue, Chang, Yu, An, Jianye, Du, Yongqiang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10495955/ https://www.ncbi.nlm.nih.gov/pubmed/37705631 http://dx.doi.org/10.7717/peerj-cs.1544

_version_	1785105004252626944
author	Xu, Xue Chang, Yu An, Jianye Du, Yongqiang
author_facet	Xu, Xue Chang, Yu An, Jianye Du, Yongqiang
author_sort	Xu, Xue
collection	PubMed
description	Text classification is an important and classic application in natural language processing (NLP). Recent studies have shown that graph neural networks (GNNs) are effective in tasks with rich structural relationships and serve as effective transductive learning approaches. Text representation learning methods based on large-scale pretraining can learn implicit but rich semantic information from text. However, few studies have comprehensively utilized the contextual semantic and structural information for Chinese text classification. Moreover, the existing GNN methods for text classification did not consider the applicability of their graph construction methods to long or short texts. In this work, we propose Chinese-BERTology-wwm-GCN, a framework that combines Chinese bidirectional encoder representations from transformers (BERT) series models with whole word masking (Chinese-BERTology-wwm) and the graph convolutional network (GCN) for Chinese text classification. When building text graph, we use documents and words as nodes to construct a heterogeneous graph for the entire corpus. Specifically, we use the term frequency-inverse document frequency (TF-IDF) to construct the word-document edge weights. For long text corpora, we propose an improved pointwise mutual information (PMI*) measure for words according to their word co-occurrence distances to represent the weights of word-word edges. For short text corpora, the co-occurrence information between words is often limited. Therefore, we utilize cosine similarity to represent the word-word edge weights. During the training stage, we effectively combine the cross-entropy and hinge losses and use them to jointly train Chinese-BERTology-wwm and GCN. Experiments show that our proposed framework significantly outperforms the baselines on three Chinese benchmark datasets and achieves good performance even with few labeled training sets.
format	Online Article Text
id	pubmed-10495955
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-104959552023-09-13 Chinese text classification by combining Chinese-BERTology-wwm and GCN Xu, Xue Chang, Yu An, Jianye Du, Yongqiang PeerJ Comput Sci Artificial Intelligence Text classification is an important and classic application in natural language processing (NLP). Recent studies have shown that graph neural networks (GNNs) are effective in tasks with rich structural relationships and serve as effective transductive learning approaches. Text representation learning methods based on large-scale pretraining can learn implicit but rich semantic information from text. However, few studies have comprehensively utilized the contextual semantic and structural information for Chinese text classification. Moreover, the existing GNN methods for text classification did not consider the applicability of their graph construction methods to long or short texts. In this work, we propose Chinese-BERTology-wwm-GCN, a framework that combines Chinese bidirectional encoder representations from transformers (BERT) series models with whole word masking (Chinese-BERTology-wwm) and the graph convolutional network (GCN) for Chinese text classification. When building text graph, we use documents and words as nodes to construct a heterogeneous graph for the entire corpus. Specifically, we use the term frequency-inverse document frequency (TF-IDF) to construct the word-document edge weights. For long text corpora, we propose an improved pointwise mutual information (PMI*) measure for words according to their word co-occurrence distances to represent the weights of word-word edges. For short text corpora, the co-occurrence information between words is often limited. Therefore, we utilize cosine similarity to represent the word-word edge weights. During the training stage, we effectively combine the cross-entropy and hinge losses and use them to jointly train Chinese-BERTology-wwm and GCN. Experiments show that our proposed framework significantly outperforms the baselines on three Chinese benchmark datasets and achieves good performance even with few labeled training sets. PeerJ Inc. 2023-08-17 /pmc/articles/PMC10495955/ /pubmed/37705631 http://dx.doi.org/10.7717/peerj-cs.1544 Text en © 2023 Xu et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Artificial Intelligence Xu, Xue Chang, Yu An, Jianye Du, Yongqiang Chinese text classification by combining Chinese-BERTology-wwm and GCN
title	Chinese text classification by combining Chinese-BERTology-wwm and GCN
title_full	Chinese text classification by combining Chinese-BERTology-wwm and GCN
title_fullStr	Chinese text classification by combining Chinese-BERTology-wwm and GCN
title_full_unstemmed	Chinese text classification by combining Chinese-BERTology-wwm and GCN
title_short	Chinese text classification by combining Chinese-BERTology-wwm and GCN
title_sort	chinese text classification by combining chinese-bertology-wwm and gcn
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10495955/ https://www.ncbi.nlm.nih.gov/pubmed/37705631 http://dx.doi.org/10.7717/peerj-cs.1544
work_keys_str_mv	AT xuxue chinesetextclassificationbycombiningchinesebertologywwmandgcn AT changyu chinesetextclassificationbycombiningchinesebertologywwmandgcn AT anjianye chinesetextclassificationbycombiningchinesebertologywwmandgcn AT duyongqiang chinesetextclassificationbycombiningchinesebertologywwmandgcn

Chinese text classification by combining Chinese-BERTology-wwm and GCN

Ejemplares similares