Cargando…

Assessment of disease named entity recognition on a corpus of annotated sentences

BACKGROUND: In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Differe...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jimeno, Antonio, Jimenez-Ruiz, Ernesto, Lee, Vivian, Gaudan, Sylvain, Berlanga, Rafael, Rebholz-Schuhmann, Dietrich
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352871/ https://www.ncbi.nlm.nih.gov/pubmed/18426548 http://dx.doi.org/10.1186/1471-2105-9-S3-S3

_version_	1782152859975942144
author	Jimeno, Antonio Jimenez-Ruiz, Ernesto Lee, Vivian Gaudan, Sylvain Berlanga, Rafael Rebholz-Schuhmann, Dietrich
author_facet	Jimeno, Antonio Jimenez-Ruiz, Ernesto Lee, Vivian Gaudan, Sylvain Berlanga, Rafael Rebholz-Schuhmann, Dietrich
author_sort	Jimeno, Antonio
collection	PubMed
description	BACKGROUND: In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions. RESULTS: As part of our research work, we have taken a corpus that has been delivered in the past for the identification of associations of genes to diseases based on the UMLS Metathesaurus and we have reprocessed and re-annotated the corpus. We have gathered annotations for disease entities from two curators, analyzed their disagreement (0.51 in the kappa-statistic) and composed a single annotated corpus for public use. Thereafter, three solutions for disease named entity recognition including MetaMap have been applied to the corpus to automatically annotate it with UMLS Metathesaurus concepts. The resulting annotations have been benchmarked to compare their performance. CONCLUSIONS: The annotated corpus is publicly available at and can serve as a benchmark to other systems. In addition, we found that dictionary look-up already provides competitive results indicating that the use of disease terminology is highly standardized throughout the terminologies and the literature. MetaMap generates precise results at the expense of insufficient recall while our statistical method obtains better recall at a lower precision rate. Even better results in terms of precision are achieved by combining at least two of the three methods leading, but this approach again lowers recall. Altogether, our analysis gives a better understanding of the complexity of disease annotations in the literature. MetaMap and the dictionary based approach are available through the Whatizit web service infrastructure (Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: Calling Whatizit. Bioinformatics 2008, 24:296-298).
format	Text
id	pubmed-2352871
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-23528712008-04-29 Assessment of disease named entity recognition on a corpus of annotated sentences Jimeno, Antonio Jimenez-Ruiz, Ernesto Lee, Vivian Gaudan, Sylvain Berlanga, Rafael Rebholz-Schuhmann, Dietrich BMC Bioinformatics Proceedings BACKGROUND: In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions. RESULTS: As part of our research work, we have taken a corpus that has been delivered in the past for the identification of associations of genes to diseases based on the UMLS Metathesaurus and we have reprocessed and re-annotated the corpus. We have gathered annotations for disease entities from two curators, analyzed their disagreement (0.51 in the kappa-statistic) and composed a single annotated corpus for public use. Thereafter, three solutions for disease named entity recognition including MetaMap have been applied to the corpus to automatically annotate it with UMLS Metathesaurus concepts. The resulting annotations have been benchmarked to compare their performance. CONCLUSIONS: The annotated corpus is publicly available at and can serve as a benchmark to other systems. In addition, we found that dictionary look-up already provides competitive results indicating that the use of disease terminology is highly standardized throughout the terminologies and the literature. MetaMap generates precise results at the expense of insufficient recall while our statistical method obtains better recall at a lower precision rate. Even better results in terms of precision are achieved by combining at least two of the three methods leading, but this approach again lowers recall. Altogether, our analysis gives a better understanding of the complexity of disease annotations in the literature. MetaMap and the dictionary based approach are available through the Whatizit web service infrastructure (Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A: Text processing through Web services: Calling Whatizit. Bioinformatics 2008, 24:296-298). BioMed Central 2008-04-11 /pmc/articles/PMC2352871/ /pubmed/18426548 http://dx.doi.org/10.1186/1471-2105-9-S3-S3 Text en Copyright © 2008 Jimeno et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Jimeno, Antonio Jimenez-Ruiz, Ernesto Lee, Vivian Gaudan, Sylvain Berlanga, Rafael Rebholz-Schuhmann, Dietrich Assessment of disease named entity recognition on a corpus of annotated sentences
title	Assessment of disease named entity recognition on a corpus of annotated sentences
title_full	Assessment of disease named entity recognition on a corpus of annotated sentences
title_fullStr	Assessment of disease named entity recognition on a corpus of annotated sentences
title_full_unstemmed	Assessment of disease named entity recognition on a corpus of annotated sentences
title_short	Assessment of disease named entity recognition on a corpus of annotated sentences
title_sort	assessment of disease named entity recognition on a corpus of annotated sentences
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352871/ https://www.ncbi.nlm.nih.gov/pubmed/18426548 http://dx.doi.org/10.1186/1471-2105-9-S3-S3
work_keys_str_mv	AT jimenoantonio assessmentofdiseasenamedentityrecognitiononacorpusofannotatedsentences AT jimenezruizernesto assessmentofdiseasenamedentityrecognitiononacorpusofannotatedsentences AT leevivian assessmentofdiseasenamedentityrecognitiononacorpusofannotatedsentences AT gaudansylvain assessmentofdiseasenamedentityrecognitiononacorpusofannotatedsentences AT berlangarafael assessmentofdiseasenamedentityrecognitiononacorpusofannotatedsentences AT rebholzschuhmanndietrich assessmentofdiseasenamedentityrecognitiononacorpusofannotatedsentences

Assessment of disease named entity recognition on a corpus of annotated sentences

Ejemplares similares