Cargando…
Identifying named entities from PubMed® for enriching semantic categories
BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature,...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349776/ https://www.ncbi.nlm.nih.gov/pubmed/25887671 http://dx.doi.org/10.1186/s12859-015-0487-2 |
_version_ | 1782360083146997760 |
---|---|
author | Kim, Sun Lu, Zhiyong Wilbur, W John |
author_facet | Kim, Sun Lu, Zhiyong Wilbur, W John |
author_sort | Kim, Sun |
collection | PubMed |
description | BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®. RESULTS: We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”. CONCLUSIONS: Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4349776 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-43497762015-03-06 Identifying named entities from PubMed® for enriching semantic categories Kim, Sun Lu, Zhiyong Wilbur, W John BMC Bioinformatics Research Article BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®. RESULTS: We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”. CONCLUSIONS: Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-21 /pmc/articles/PMC4349776/ /pubmed/25887671 http://dx.doi.org/10.1186/s12859-015-0487-2 Text en © Kim et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Kim, Sun Lu, Zhiyong Wilbur, W John Identifying named entities from PubMed® for enriching semantic categories |
title | Identifying named entities from PubMed® for enriching semantic categories |
title_full | Identifying named entities from PubMed® for enriching semantic categories |
title_fullStr | Identifying named entities from PubMed® for enriching semantic categories |
title_full_unstemmed | Identifying named entities from PubMed® for enriching semantic categories |
title_short | Identifying named entities from PubMed® for enriching semantic categories |
title_sort | identifying named entities from pubmed® for enriching semantic categories |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349776/ https://www.ncbi.nlm.nih.gov/pubmed/25887671 http://dx.doi.org/10.1186/s12859-015-0487-2 |
work_keys_str_mv | AT kimsun identifyingnamedentitiesfrompubmedforenrichingsemanticcategories AT luzhiyong identifyingnamedentitiesfrompubmedforenrichingsemanticcategories AT wilburwjohn identifyingnamedentitiesfrompubmedforenrichingsemanticcategories |