Cargando…
Fast and scalable neural embedding models for biomedical sentence classification
BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, M...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6303852/ https://www.ncbi.nlm.nih.gov/pubmed/30577747 http://dx.doi.org/10.1186/s12859-018-2496-4 |
_version_ | 1783382241096761344 |
---|---|
author | Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias |
author_facet | Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias |
author_sort | Agibetov, Asan |
collection | PubMed |
description | BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited. RESULTS: Our fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account. CONCLUSIONS: Because of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited. |
format | Online Article Text |
id | pubmed-6303852 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-63038522018-12-31 Fast and scalable neural embedding models for biomedical sentence classification Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias BMC Bioinformatics Research Article BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited. RESULTS: Our fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account. CONCLUSIONS: Because of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited. BioMed Central 2018-12-22 /pmc/articles/PMC6303852/ /pubmed/30577747 http://dx.doi.org/10.1186/s12859-018-2496-4 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias Fast and scalable neural embedding models for biomedical sentence classification |
title | Fast and scalable neural embedding models for biomedical sentence classification |
title_full | Fast and scalable neural embedding models for biomedical sentence classification |
title_fullStr | Fast and scalable neural embedding models for biomedical sentence classification |
title_full_unstemmed | Fast and scalable neural embedding models for biomedical sentence classification |
title_short | Fast and scalable neural embedding models for biomedical sentence classification |
title_sort | fast and scalable neural embedding models for biomedical sentence classification |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6303852/ https://www.ncbi.nlm.nih.gov/pubmed/30577747 http://dx.doi.org/10.1186/s12859-018-2496-4 |
work_keys_str_mv | AT agibetovasan fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification AT blageckathrin fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification AT xuhong fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification AT samwaldmatthias fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification |