Cargando…

Identifying named entities from PubMed® for enriching semantic categories

BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature,...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Sun, Lu, Zhiyong, Wilbur, W John
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349776/
https://www.ncbi.nlm.nih.gov/pubmed/25887671
http://dx.doi.org/10.1186/s12859-015-0487-2
_version_ 1782360083146997760
author Kim, Sun
Lu, Zhiyong
Wilbur, W John
author_facet Kim, Sun
Lu, Zhiyong
Wilbur, W John
author_sort Kim, Sun
collection PubMed
description BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®. RESULTS: We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”. CONCLUSIONS: Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4349776
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43497762015-03-06 Identifying named entities from PubMed® for enriching semantic categories Kim, Sun Lu, Zhiyong Wilbur, W John BMC Bioinformatics Research Article BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®. RESULTS: We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”. CONCLUSIONS: Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-21 /pmc/articles/PMC4349776/ /pubmed/25887671 http://dx.doi.org/10.1186/s12859-015-0487-2 Text en © Kim et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Kim, Sun
Lu, Zhiyong
Wilbur, W John
Identifying named entities from PubMed® for enriching semantic categories
title Identifying named entities from PubMed® for enriching semantic categories
title_full Identifying named entities from PubMed® for enriching semantic categories
title_fullStr Identifying named entities from PubMed® for enriching semantic categories
title_full_unstemmed Identifying named entities from PubMed® for enriching semantic categories
title_short Identifying named entities from PubMed® for enriching semantic categories
title_sort identifying named entities from pubmed® for enriching semantic categories
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349776/
https://www.ncbi.nlm.nih.gov/pubmed/25887671
http://dx.doi.org/10.1186/s12859-015-0487-2
work_keys_str_mv AT kimsun identifyingnamedentitiesfrompubmedforenrichingsemanticcategories
AT luzhiyong identifyingnamedentitiesfrompubmedforenrichingsemanticcategories
AT wilburwjohn identifyingnamedentitiesfrompubmedforenrichingsemanticcategories