Cargando…

Dynamic summarization of bibliographic-based data

BACKGROUND: Traditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natura...

Descripción completa

Detalles Bibliográficos
Autores principales:	Workman, T Elizabeth, Hurdle, John F
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042900/ https://www.ncbi.nlm.nih.gov/pubmed/21284871 http://dx.doi.org/10.1186/1472-6947-11-6

_version_	1782198569741058048
author	Workman, T Elizabeth Hurdle, John F
author_facet	Workman, T Elizabeth Hurdle, John F
author_sort	Workman, T Elizabeth
collection	PubMed
description	BACKGROUND: Traditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natural language processing application, highlights relevant information in PubMed data. However, Semantic MEDLINE implements manually coded schemas, accommodating few information needs. Currently, there are only five such schemas, while many more would be needed to realistically accommodate all potential users. The aim of this project was to develop and evaluate a statistical algorithm that automatically identifies relevant bibliographic data; the new algorithm could be incorporated into a dynamic schema to accommodate various information needs in Semantic MEDLINE, and eliminate the need for multiple schemas. METHODS: We developed a flexible algorithm named Combo that combines three statistical metrics, the Kullback-Leibler Divergence (KLD), Riloff's RlogF metric (RlogF), and a new metric called PredScal, to automatically identify salient data in bibliographic text. We downloaded citations from a PubMed search query addressing the genetic etiology of bladder cancer. The citations were processed with SemRep, an NLM rule-based application that produces semantic predications. SemRep output was processed by Combo, in addition to the standard Semantic MEDLINE genetics schema and independently by the two individual KLD and RlogF metrics. We evaluated each summarization method using an existing reference standard within the task-based context of genetic database curation. RESULTS: Combo asserted 74 genetic entities implicated in bladder cancer development, whereas the traditional schema asserted 10 genetic entities; the KLD and RlogF metrics individually asserted 77 and 69 genetic entities, respectively. Combo achieved 61% recall and 81% precision, with an F-score of 0.69. The traditional schema achieved 23% recall and 100% precision, with an F-score of 0.37. The KLD metric achieved 61% recall, 70% precision, with an F-score of 0.65. The RlogF metric achieved 61% recall, 72% precision, with an F-score of 0.66. CONCLUSIONS: Semantic MEDLINE summarization using the new Combo algorithm outperformed a conventional summarization schema in a genetic database curation task. It potentially could streamline information acquisition for other needs without having to hand-build multiple saliency schemas.
format	Text
id	pubmed-3042900
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30429002011-02-25 Dynamic summarization of bibliographic-based data Workman, T Elizabeth Hurdle, John F BMC Med Inform Decis Mak Research Article BACKGROUND: Traditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natural language processing application, highlights relevant information in PubMed data. However, Semantic MEDLINE implements manually coded schemas, accommodating few information needs. Currently, there are only five such schemas, while many more would be needed to realistically accommodate all potential users. The aim of this project was to develop and evaluate a statistical algorithm that automatically identifies relevant bibliographic data; the new algorithm could be incorporated into a dynamic schema to accommodate various information needs in Semantic MEDLINE, and eliminate the need for multiple schemas. METHODS: We developed a flexible algorithm named Combo that combines three statistical metrics, the Kullback-Leibler Divergence (KLD), Riloff's RlogF metric (RlogF), and a new metric called PredScal, to automatically identify salient data in bibliographic text. We downloaded citations from a PubMed search query addressing the genetic etiology of bladder cancer. The citations were processed with SemRep, an NLM rule-based application that produces semantic predications. SemRep output was processed by Combo, in addition to the standard Semantic MEDLINE genetics schema and independently by the two individual KLD and RlogF metrics. We evaluated each summarization method using an existing reference standard within the task-based context of genetic database curation. RESULTS: Combo asserted 74 genetic entities implicated in bladder cancer development, whereas the traditional schema asserted 10 genetic entities; the KLD and RlogF metrics individually asserted 77 and 69 genetic entities, respectively. Combo achieved 61% recall and 81% precision, with an F-score of 0.69. The traditional schema achieved 23% recall and 100% precision, with an F-score of 0.37. The KLD metric achieved 61% recall, 70% precision, with an F-score of 0.65. The RlogF metric achieved 61% recall, 72% precision, with an F-score of 0.66. CONCLUSIONS: Semantic MEDLINE summarization using the new Combo algorithm outperformed a conventional summarization schema in a genetic database curation task. It potentially could streamline information acquisition for other needs without having to hand-build multiple saliency schemas. BioMed Central 2011-02-01 /pmc/articles/PMC3042900/ /pubmed/21284871 http://dx.doi.org/10.1186/1472-6947-11-6 Text en Copyright ©2011 Workman and Hurdle; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Workman, T Elizabeth Hurdle, John F Dynamic summarization of bibliographic-based data
title	Dynamic summarization of bibliographic-based data
title_full	Dynamic summarization of bibliographic-based data
title_fullStr	Dynamic summarization of bibliographic-based data
title_full_unstemmed	Dynamic summarization of bibliographic-based data
title_short	Dynamic summarization of bibliographic-based data
title_sort	dynamic summarization of bibliographic-based data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042900/ https://www.ncbi.nlm.nih.gov/pubmed/21284871 http://dx.doi.org/10.1186/1472-6947-11-6
work_keys_str_mv	AT workmantelizabeth dynamicsummarizationofbibliographicbaseddata AT hurdlejohnf dynamicsummarizationofbibliographicbaseddata

Dynamic summarization of bibliographic-based data

Ejemplares similares