Cargando…

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

BACKGROUND: Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep lea...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Feihong, Wang, Xuwen, Ma, Hetong, Li, Jiao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323195/
https://www.ncbi.nlm.nih.gov/pubmed/34330244
http://dx.doi.org/10.1186/s12911-021-01459-0
_version_ 1783731193844334592
author Yang, Feihong
Wang, Xuwen
Ma, Hetong
Li, Jiao
author_facet Yang, Feihong
Wang, Xuwen
Ma, Hetong
Li, Jiao
author_sort Yang, Feihong
collection PubMed
description BACKGROUND: Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. METHODS: In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. RESULTS: We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. CONCLUSIONS: The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.
format Online
Article
Text
id pubmed-8323195
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-83231952021-07-30 Transformers-sklearn: a toolkit for medical language understanding with transformer-based models Yang, Feihong Wang, Xuwen Ma, Hetong Li, Jiao BMC Med Inform Decis Mak Software BACKGROUND: Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. METHODS: In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. RESULTS: We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. CONCLUSIONS: The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn. BioMed Central 2021-07-30 /pmc/articles/PMC8323195/ /pubmed/34330244 http://dx.doi.org/10.1186/s12911-021-01459-0 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Yang, Feihong
Wang, Xuwen
Ma, Hetong
Li, Jiao
Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_full Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_fullStr Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_full_unstemmed Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_short Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
title_sort transformers-sklearn: a toolkit for medical language understanding with transformer-based models
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8323195/
https://www.ncbi.nlm.nih.gov/pubmed/34330244
http://dx.doi.org/10.1186/s12911-021-01459-0
work_keys_str_mv AT yangfeihong transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels
AT wangxuwen transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels
AT mahetong transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels
AT lijiao transformerssklearnatoolkitformedicallanguageunderstandingwithtransformerbasedmodels