Cargando…

Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm

BACKGROUND: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical...

Descripción completa

Detalles Bibliográficos
Autores principales:	Barrett, Neil, Weber-Jahnke, Jens
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111587/ https://www.ncbi.nlm.nih.gov/pubmed/21658288 http://dx.doi.org/10.1186/1471-2105-12-S3-S1

_version_	1782205649351868416
author	Barrett, Neil Weber-Jahnke, Jens
author_facet	Barrett, Neil Weber-Jahnke, Jens
author_sort	Barrett, Neil
collection	PubMed
description	BACKGROUND: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer’s output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text. RESULTS: Medpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively. CONCLUSIONS: Our evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization.
format	Online Article Text
id	pubmed-3111587
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31115872011-06-11 Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm Barrett, Neil Weber-Jahnke, Jens BMC Bioinformatics Research BACKGROUND: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer’s output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text. RESULTS: Medpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively. CONCLUSIONS: Our evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization. BioMed Central 2011-06-09 /pmc/articles/PMC3111587/ /pubmed/21658288 http://dx.doi.org/10.1186/1471-2105-12-S3-S1 Text en Copyright ©2011 Barrett and Weber-Jahnke. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Barrett, Neil Weber-Jahnke, Jens Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
title	Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
title_full	Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
title_fullStr	Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
title_full_unstemmed	Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
title_short	Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
title_sort	building a biomedical tokenizer using the token lattice design pattern and the adapted viterbi algorithm
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111587/ https://www.ncbi.nlm.nih.gov/pubmed/21658288 http://dx.doi.org/10.1186/1471-2105-12-S3-S1
work_keys_str_mv	AT barrettneil buildingabiomedicaltokenizerusingthetokenlatticedesignpatternandtheadaptedviterbialgorithm AT weberjahnkejens buildingabiomedicaltokenizerusingthetokenlatticedesignpatternandtheadaptedviterbialgorithm

Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm

Ejemplares similares