Cargando…

A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

BACKGROUND: Given the threat posed by cancer to human health, there is a rapid growth in the volume of data in the cancer field and interdisciplinary and collaborative research is becoming increasingly important for fine-grained classification. The low-resolution classifier of reported studies at th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Ying, Li, Xiaoying, Liu, Yi, Li, Aihua, Yang, Xuemei, Tang, Xiaoli
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10587805/ https://www.ncbi.nlm.nih.gov/pubmed/37796584 http://dx.doi.org/10.2196/44892

_version_	1785123448660426752
author	Zhang, Ying Li, Xiaoying Liu, Yi Li, Aihua Yang, Xuemei Tang, Xiaoli
author_facet	Zhang, Ying Li, Xiaoying Liu, Yi Li, Aihua Yang, Xuemei Tang, Xiaoli
author_sort	Zhang, Ying
collection	PubMed
description	BACKGROUND: Given the threat posed by cancer to human health, there is a rapid growth in the volume of data in the cancer field and interdisciplinary and collaborative research is becoming increasingly important for fine-grained classification. The low-resolution classifier of reported studies at the journal level fails to satisfy advanced searching demands, and a single label does not adequately characterize the literature originated from interdisciplinary research results. There is thus a need to establish a multilabel classifier with higher resolution to support literature retrieval for cancer research and reduce the burden of screening papers for clinical relevance. OBJECTIVE: The primary objective of this research was to address the low-resolution issue of cancer literature classification due to the ambiguity of the existing journal-level classifier in order to support gaining high-relevance evidence for clinical consideration and all-sided results for literature retrieval. METHODS: We trained a multilabel classifier with scalability for classifying the literature on cancer research directly at the publication level to assign proper content-derived labels based on the “Bidirectional Encoder Representation from Transformers (BERT) + X” model and obtain the best option for X. First, a corpus of 70,599 cancer publications retrieved from the Dimensions database was divided into a training and a testing set in a ratio of 7:3. Second, using the classification terminology of International Cancer Research Partnership cancer types, we compared the performance of classifiers developed using BERT and 5 classical deep learning models, such as the text recurrent neural network (TextRNN) and FastText, followed by metrics analysis. RESULTS: After comparing various combined deep learning models, we obtained a classifier based on the optimal combination “BERT + TextRNN,” with a precision of 93.09%, a recall of 87.75%, and an F(1)-score of 90.34%. Moreover, we quantified the distinctive characteristics in the text structure and multilabel distribution in order to generalize the model to other fields with similar characteristics. CONCLUSIONS: The “BERT + TextRNN” model was trained for high-resolution classification of cancer literature at the publication level to support accurate retrieval and academic statistics. The model automatically assigns 1 or more labels to each cancer paper, as required. Quantitative comparison verified that the “BERT + TextRNN” model is the best fit for multilabel classification of cancer literature compared to other models. More data from diverse fields will be collected to testify the scalability and extensibility of the proposed model in the future.
format	Online Article Text
id	pubmed-10587805
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-105878052023-10-21 A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification Zhang, Ying Li, Xiaoying Liu, Yi Li, Aihua Yang, Xuemei Tang, Xiaoli JMIR Med Inform Original Paper BACKGROUND: Given the threat posed by cancer to human health, there is a rapid growth in the volume of data in the cancer field and interdisciplinary and collaborative research is becoming increasingly important for fine-grained classification. The low-resolution classifier of reported studies at the journal level fails to satisfy advanced searching demands, and a single label does not adequately characterize the literature originated from interdisciplinary research results. There is thus a need to establish a multilabel classifier with higher resolution to support literature retrieval for cancer research and reduce the burden of screening papers for clinical relevance. OBJECTIVE: The primary objective of this research was to address the low-resolution issue of cancer literature classification due to the ambiguity of the existing journal-level classifier in order to support gaining high-relevance evidence for clinical consideration and all-sided results for literature retrieval. METHODS: We trained a multilabel classifier with scalability for classifying the literature on cancer research directly at the publication level to assign proper content-derived labels based on the “Bidirectional Encoder Representation from Transformers (BERT) + X” model and obtain the best option for X. First, a corpus of 70,599 cancer publications retrieved from the Dimensions database was divided into a training and a testing set in a ratio of 7:3. Second, using the classification terminology of International Cancer Research Partnership cancer types, we compared the performance of classifiers developed using BERT and 5 classical deep learning models, such as the text recurrent neural network (TextRNN) and FastText, followed by metrics analysis. RESULTS: After comparing various combined deep learning models, we obtained a classifier based on the optimal combination “BERT + TextRNN,” with a precision of 93.09%, a recall of 87.75%, and an F(1)-score of 90.34%. Moreover, we quantified the distinctive characteristics in the text structure and multilabel distribution in order to generalize the model to other fields with similar characteristics. CONCLUSIONS: The “BERT + TextRNN” model was trained for high-resolution classification of cancer literature at the publication level to support accurate retrieval and academic statistics. The model automatically assigns 1 or more labels to each cancer paper, as required. Quantitative comparison verified that the “BERT + TextRNN” model is the best fit for multilabel classification of cancer literature compared to other models. More data from diverse fields will be collected to testify the scalability and extensibility of the proposed model in the future. JMIR Publications 2023-10-05 /pmc/articles/PMC10587805/ /pubmed/37796584 http://dx.doi.org/10.2196/44892 Text en ©Ying Zhang, Xiaoying Li, Yi Liu, Aihua Li, Xuemei Yang, Xiaoli Tang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 05.10.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Zhang, Ying Li, Xiaoying Liu, Yi Li, Aihua Yang, Xuemei Tang, Xiaoli A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification
title	A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification
title_full	A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification
title_fullStr	A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification
title_full_unstemmed	A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification
title_short	A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification
title_sort	multilabel text classifier of cancer literature at the publication level: methods study of medical text classification
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10587805/ https://www.ncbi.nlm.nih.gov/pubmed/37796584 http://dx.doi.org/10.2196/44892
work_keys_str_mv	AT zhangying amultilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT lixiaoying amultilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT liuyi amultilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT liaihua amultilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT yangxuemei amultilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT tangxiaoli amultilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT zhangying multilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT lixiaoying multilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT liuyi multilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT liaihua multilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT yangxuemei multilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification AT tangxiaoli multilabeltextclassifierofcancerliteratureatthepublicationlevelmethodsstudyofmedicaltextclassification

A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

Ejemplares similares