Cargando…

Mapping biological entities using the longest approximately common prefix method

BACKGROUND: The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The tas...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rudniy, Alex, Song, Min, Geller, James
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4086698/ https://www.ncbi.nlm.nih.gov/pubmed/24928653 http://dx.doi.org/10.1186/1471-2105-15-187

_version_	1782324838890733568
author	Rudniy, Alex Song, Min Geller, James
author_facet	Rudniy, Alex Song, Min Geller, James
author_sort	Rudniy, Alex
collection	PubMed
description	BACKGROUND: The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation. RESULTS: This paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for approximate string matching that runs in linear time. We compare the LACP method for performance, precision and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we present a spell checker based on the LACP method. CONCLUSIONS: The Longest Approximately Common Prefix method completes its string similarity evaluations in less time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix outperforms these nine approximate string matching methods in its Maximum F(1) measure when evaluated on three out of the four datasets, and in its average precision on two of the four datasets.
format	Online Article Text
id	pubmed-4086698
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40866982014-07-24 Mapping biological entities using the longest approximately common prefix method Rudniy, Alex Song, Min Geller, James BMC Bioinformatics Methodology Article BACKGROUND: The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation. RESULTS: This paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for approximate string matching that runs in linear time. We compare the LACP method for performance, precision and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we present a spell checker based on the LACP method. CONCLUSIONS: The Longest Approximately Common Prefix method completes its string similarity evaluations in less time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix outperforms these nine approximate string matching methods in its Maximum F(1) measure when evaluated on three out of the four datasets, and in its average precision on two of the four datasets. BioMed Central 2014-06-14 /pmc/articles/PMC4086698/ /pubmed/24928653 http://dx.doi.org/10.1186/1471-2105-15-187 Text en Copyright © 2014 Rudniy et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Rudniy, Alex Song, Min Geller, James Mapping biological entities using the longest approximately common prefix method
title	Mapping biological entities using the longest approximately common prefix method
title_full	Mapping biological entities using the longest approximately common prefix method
title_fullStr	Mapping biological entities using the longest approximately common prefix method
title_full_unstemmed	Mapping biological entities using the longest approximately common prefix method
title_short	Mapping biological entities using the longest approximately common prefix method
title_sort	mapping biological entities using the longest approximately common prefix method
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4086698/ https://www.ncbi.nlm.nih.gov/pubmed/24928653 http://dx.doi.org/10.1186/1471-2105-15-187
work_keys_str_mv	AT rudniyalex mappingbiologicalentitiesusingthelongestapproximatelycommonprefixmethod AT songmin mappingbiologicalentitiesusingthelongestapproximatelycommonprefixmethod AT gellerjames mappingbiologicalentitiesusingthelongestapproximatelycommonprefixmethod

Mapping biological entities using the longest approximately common prefix method

Ejemplares similares