Cargando…

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

BACKGROUND: Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep lea...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Feihong, Wang, Xuwen, Ma, Hetong, Li, Jiao
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323195/ https://www.ncbi.nlm.nih.gov/pubmed/34330244 http://dx.doi.org/10.1186/s12911-021-01459-0

_version_	1783731193844334592
author	Yang, Feihong Wang, Xuwen Ma, Hetong Li, Jiao
author_facet	Yang, Feihong Wang, Xuwen Ma, Hetong Li, Jiao
author_sort	Yang, Feihong
collection	PubMed
description	BACKGROUND: Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. METHODS: In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. RESULTS: We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. CONCLUSIONS: The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.
format	Online Article Text
id	pubmed-8323195
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-83231952021-07-30 Transformers-sklearn: a toolkit for medical language understanding with transformer-based models Yang, Feihong Wang, Xuwen Ma, Hetong Li, Jiao BMC Med Inform Decis Mak Software BACKGROUND: Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. METHODS: In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. RESULTS: We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. CONCLUSIONS: The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn. BioMed Central 2021-07-30 /pmc/articles/PMC8323195/ /pubmed/34330244 http://dx.doi.org/10.1186/s12911-021-01459-0 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Software Yang, Feihong Wang, Xuwen Ma, Hetong Li, Jiao Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title	Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_full	Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_fullStr	Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_full_unstemmed	Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_short	Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_sort	transformers-sklearn: a toolkit for medical language understanding with transformer-based models
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323195/ https://www.ncbi.nlm.nih.gov/pubmed/34330244 http://dx.doi.org/10.1186/s12911-021-01459-0
work_keys_str_mv	AT yangfeihong transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels AT wangxuwen transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels AT mahetong transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels AT lijiao transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

Ejemplares similares