Cargando…

Fast and scalable neural embedding models for biomedical sentence classification

BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, M...

Descripción completa

Detalles Bibliográficos
Autores principales:	Agibetov, Asan, Blagec, Kathrin, Xu, Hong, Samwald, Matthias
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6303852/ https://www.ncbi.nlm.nih.gov/pubmed/30577747 http://dx.doi.org/10.1186/s12859-018-2496-4

_version_	1783382241096761344
author	Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias
author_facet	Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias
author_sort	Agibetov, Asan
collection	PubMed
description	BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited. RESULTS: Our fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account. CONCLUSIONS: Because of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited.
format	Online Article Text
id	pubmed-6303852
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-63038522018-12-31 Fast and scalable neural embedding models for biomedical sentence classification Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias BMC Bioinformatics Research Article BACKGROUND: Biomedical literature is expanding rapidly, and tools that help locate information of interest are needed. To this end, a multitude of different approaches for classifying sentences in biomedical publications according to their coarse semantic and rhetoric categories (e.g., Background, Methods, Results, Conclusions) have been devised, with recent state-of-the-art results reported for a complex deep learning model. Recent evidence showed that shallow and wide neural models such as fastText can provide results that are competitive or superior to complex deep learning models while requiring drastically lower training times and having better scalability. We analyze the efficacy of the fastText model in the classification of biomedical sentences in the PubMed 200k RCT benchmark, and introduce a simple pre-processing step that enables the application of fastText on sentence sequences. Furthermore, we explore the utility of two unsupervised pre-training approaches in scenarios where labeled training data are limited. RESULTS: Our fastText-based methodology yields a state-of-the-art F1 score of.917 on the PubMed 200k benchmark when sentence ordering is taken into account, with a training time of only 73 s on standard hardware. Applying fastText on single sentences, without taking sentence ordering into account, yielded an F1 score of.852 (training time 13 s). Unsupervised pre-training of N-gram vectors greatly improved the results for small training set sizes, with an increase of F1 score of.21 to.74 when trained on only 1000 randomly picked sentences without taking sentence ordering into account. CONCLUSIONS: Because of it’s ease of use and performance, fastText should be among the first choices of tools when tackling biomedical text classification problems with large corpora. Unsupervised pre-training of N-gram vectors on domain-specific corpora also makes it possible to apply fastText when labeled training data are limited. BioMed Central 2018-12-22 /pmc/articles/PMC6303852/ /pubmed/30577747 http://dx.doi.org/10.1186/s12859-018-2496-4 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Agibetov, Asan Blagec, Kathrin Xu, Hong Samwald, Matthias Fast and scalable neural embedding models for biomedical sentence classification
title	Fast and scalable neural embedding models for biomedical sentence classification
title_full	Fast and scalable neural embedding models for biomedical sentence classification
title_fullStr	Fast and scalable neural embedding models for biomedical sentence classification
title_full_unstemmed	Fast and scalable neural embedding models for biomedical sentence classification
title_short	Fast and scalable neural embedding models for biomedical sentence classification
title_sort	fast and scalable neural embedding models for biomedical sentence classification
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6303852/ https://www.ncbi.nlm.nih.gov/pubmed/30577747 http://dx.doi.org/10.1186/s12859-018-2496-4
work_keys_str_mv	AT agibetovasan fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification AT blageckathrin fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification AT xuhong fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification AT samwaldmatthias fastandscalableneuralembeddingmodelsforbiomedicalsentenceclassification

Fast and scalable neural embedding models for biomedical sentence classification

Ejemplares similares