Cargando…

Fast and scalable neural embedding models for biomedical sentence classification

BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, M...

Descripción completa

Detalles Bibliográficos
Autores principales: Agibetov, Asan, Blagec, Kathrin, Xu, Hong, Samwald, Matthias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6303852/
https://www.ncbi.nlm.nih.gov/pubmed/30577747
http://dx.doi.org/10.1186/s12859-018-2496-4
_version_ 1783382241096761344
author Agibetov, Asan
Blagec, Kathrin
Xu, Hong
Samwald, Matthias
author_facet Agibetov, Asan
Blagec, Kathrin
Xu, Hong
Samwald, Matthias
author_sort Agibetov, Asan
collection PubMed
description BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited. RESULTS: Our fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account. CONCLUSIONS: Because of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited.
format Online
Article
Text
id pubmed-6303852
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63038522018-12-31 Fast and scalable neural embedding models for biomedical sentence classification Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias BMC Bioinformatics Research Article BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited. RESULTS: Our fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account. CONCLUSIONS: Because of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited. BioMed Central 2018-12-22 /pmc/articles/PMC6303852/ /pubmed/30577747 http://dx.doi.org/10.1186/s12859-018-2496-4 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Agibetov, Asan
Blagec, Kathrin
Xu, Hong
Samwald, Matthias
Fast and scalable neural embedding models for biomedical sentence classification
title Fast and scalable neural embedding models for biomedical sentence classification
title_full Fast and scalable neural embedding models for biomedical sentence classification
title_fullStr Fast and scalable neural embedding models for biomedical sentence classification
title_full_unstemmed Fast and scalable neural embedding models for biomedical sentence classification
title_short Fast and scalable neural embedding models for biomedical sentence classification
title_sort fast and scalable neural embedding models for biomedical sentence classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6303852/
https://www.ncbi.nlm.nih.gov/pubmed/30577747
http://dx.doi.org/10.1186/s12859-018-2496-4
work_keys_str_mv AT agibetovasan fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification
AT blageckathrin fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification
AT xuhong fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification
AT samwaldmatthias fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification