Cargando…

Improved Part-of-Speech Prediction in Suffix Analysis

MOTIVATION: Predicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models...

Descripción completa

Detalles Bibliográficos
Autores principales:	Fruzangohar, Mario, Kroeger, Trent A., Adelson, David L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3790802/ https://www.ncbi.nlm.nih.gov/pubmed/24124532 http://dx.doi.org/10.1371/journal.pone.0076042

_version_	1782286651680096256
author	Fruzangohar, Mario Kroeger, Trent A. Adelson, David L.
author_facet	Fruzangohar, Mario Kroeger, Trent A. Adelson, David L.
author_sort	Fruzangohar, Mario
collection	PubMed
description	MOTIVATION: Predicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models (HMM). Different approaches have been taken to deal with the POS tagger challenge, but with one exception – the TnT POS tagger - previous publications on POS tagging have omitted details of the suffix analysis used for handling unknown words. The suffix of an English word is a strong predictor of a POS tag for that word. As a pre-requisite for an accurate HMM POS tagger for biomedical publications, we present an efficient suffix prediction method for integration into a POS tagger. RESULTS: We have implemented a fully functional HMM POS tagger using experimentally optimised suffix based prediction. Our simple suffix analysis method, significantly outperformed the probability interpolation based TnT method. We have also shown how important suffix analysis can be for probability estimation of a known word (in the training corpus) with an unseen POS tag; a common scenario with a small training corpus. We then integrated this simple method in our POS tagger and determined an optimised parameter set for both methods, which can help developers to optimise their current algorithm, based on our results. We also introduce the concept of counting methods in maximum likelihood estimation for the first time and show how counting methods can affect the prediction result. Finally, we describe how machine-learning techniques were applied to identify words, for which prediction of POS tags were always incorrect and propose a method to handle words of this type. AVAILABILITY AND IMPLEMENTATION: Java source code, binaries and setup instructions are freely available at http://genomes.sapac.edu.au/text_mining/pos_tagger.zip.
format	Online Article Text
id	pubmed-3790802
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-37908022013-10-11 Improved Part-of-Speech Prediction in Suffix Analysis Fruzangohar, Mario Kroeger, Trent A. Adelson, David L. PLoS One Research Article MOTIVATION: Predicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models (HMM). Different approaches have been taken to deal with the POS tagger challenge, but with one exception – the TnT POS tagger - previous publications on POS tagging have omitted details of the suffix analysis used for handling unknown words. The suffix of an English word is a strong predictor of a POS tag for that word. As a pre-requisite for an accurate HMM POS tagger for biomedical publications, we present an efficient suffix prediction method for integration into a POS tagger. RESULTS: We have implemented a fully functional HMM POS tagger using experimentally optimised suffix based prediction. Our simple suffix analysis method, significantly outperformed the probability interpolation based TnT method. We have also shown how important suffix analysis can be for probability estimation of a known word (in the training corpus) with an unseen POS tag; a common scenario with a small training corpus. We then integrated this simple method in our POS tagger and determined an optimised parameter set for both methods, which can help developers to optimise their current algorithm, based on our results. We also introduce the concept of counting methods in maximum likelihood estimation for the first time and show how counting methods can affect the prediction result. Finally, we describe how machine-learning techniques were applied to identify words, for which prediction of POS tags were always incorrect and propose a method to handle words of this type. AVAILABILITY AND IMPLEMENTATION: Java source code, binaries and setup instructions are freely available at http://genomes.sapac.edu.au/text_mining/pos_tagger.zip. Public Library of Science 2013-10-04 /pmc/articles/PMC3790802/ /pubmed/24124532 http://dx.doi.org/10.1371/journal.pone.0076042 Text en © 2013 Fruzangohar et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Fruzangohar, Mario Kroeger, Trent A. Adelson, David L. Improved Part-of-Speech Prediction in Suffix Analysis
title	Improved Part-of-Speech Prediction in Suffix Analysis
title_full	Improved Part-of-Speech Prediction in Suffix Analysis
title_fullStr	Improved Part-of-Speech Prediction in Suffix Analysis
title_full_unstemmed	Improved Part-of-Speech Prediction in Suffix Analysis
title_short	Improved Part-of-Speech Prediction in Suffix Analysis
title_sort	improved part-of-speech prediction in suffix analysis
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3790802/ https://www.ncbi.nlm.nih.gov/pubmed/24124532 http://dx.doi.org/10.1371/journal.pone.0076042
work_keys_str_mv	AT fruzangoharmario improvedpartofspeechpredictioninsuffixanalysis AT kroegertrenta improvedpartofspeechpredictioninsuffixanalysis AT adelsondavidl improvedpartofspeechpredictioninsuffixanalysis

Improved Part-of-Speech Prediction in Suffix Analysis

Ejemplares similares