Cargando…

A modular framework for biomedical concept recognition

BACKGROUND: Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for...

Descripción completa

Detalles Bibliográficos
Autores principales: Campos, David, Matos, Sérgio, Oliveira, José Luís
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3849280/
https://www.ncbi.nlm.nih.gov/pubmed/24063607
http://dx.doi.org/10.1186/1471-2105-14-281
_version_ 1782293902157414400
author Campos, David
Matos, Sérgio
Oliveira, José Luís
author_facet Campos, David
Matos, Sérgio
Oliveira, José Luís
author_sort Campos, David
collection PubMed
description BACKGROUND: Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools. RESULTS: This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification. CONCLUSIONS: Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.
format Online
Article
Text
id pubmed-3849280
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38492802013-12-06 A modular framework for biomedical concept recognition Campos, David Matos, Sérgio Oliveira, José Luís BMC Bioinformatics Software BACKGROUND: Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools. RESULTS: This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification. CONCLUSIONS: Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji. BioMed Central 2013-09-24 /pmc/articles/PMC3849280/ /pubmed/24063607 http://dx.doi.org/10.1186/1471-2105-14-281 Text en Copyright © 2013 Campos et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Campos, David
Matos, Sérgio
Oliveira, José Luís
A modular framework for biomedical concept recognition
title A modular framework for biomedical concept recognition
title_full A modular framework for biomedical concept recognition
title_fullStr A modular framework for biomedical concept recognition
title_full_unstemmed A modular framework for biomedical concept recognition
title_short A modular framework for biomedical concept recognition
title_sort modular framework for biomedical concept recognition
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3849280/
https://www.ncbi.nlm.nih.gov/pubmed/24063607
http://dx.doi.org/10.1186/1471-2105-14-281
work_keys_str_mv AT camposdavid amodularframeworkforbiomedicalconceptrecognition
AT matossergio amodularframeworkforbiomedicalconceptrecognition
AT oliveirajoseluis amodularframeworkforbiomedicalconceptrecognition
AT camposdavid modularframeworkforbiomedicalconceptrecognition
AT matossergio modularframeworkforbiomedicalconceptrecognition
AT oliveirajoseluis modularframeworkforbiomedicalconceptrecognition