Cargando…

Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users

Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCre...

Descripción completa

Detalles Bibliográficos
Autores principales: Shatkay, Hagit, Pan, Fengxia, Rzhetsky, Andrey, Wilbur, W. John
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2530883/
https://www.ncbi.nlm.nih.gov/pubmed/18718948
http://dx.doi.org/10.1093/bioinformatics/btn381
_version_ 1782158936512659456
author Shatkay, Hagit
Pan, Fengxia
Rzhetsky, Andrey
Wilbur, W. John
author_facet Shatkay, Hagit
Pan, Fengxia
Rzhetsky, Andrey
Wilbur, W. John
author_sort Shatkay, Hagit
collection PubMed
description Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca
format Text
id pubmed-2530883
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-25308832009-02-25 Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users Shatkay, Hagit Pan, Fengxia Rzhetsky, Andrey Wilbur, W. John Bioinformatics Original Papers Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice. Contact: shatkay@cs.queensu.ca Oxford University Press 2008-09-15 2008-08-20 /pmc/articles/PMC2530883/ /pubmed/18718948 http://dx.doi.org/10.1093/bioinformatics/btn381 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Shatkay, Hagit
Pan, Fengxia
Rzhetsky, Andrey
Wilbur, W. John
Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users
title Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users
title_full Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users
title_fullStr Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users
title_full_unstemmed Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users
title_short Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users
title_sort multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2530883/
https://www.ncbi.nlm.nih.gov/pubmed/18718948
http://dx.doi.org/10.1093/bioinformatics/btn381
work_keys_str_mv AT shatkayhagit multidimensionalclassificationofbiomedicaltexttowardautomatedpracticalprovisionofhighutilitytexttodiverseusers
AT panfengxia multidimensionalclassificationofbiomedicaltexttowardautomatedpracticalprovisionofhighutilitytexttodiverseusers
AT rzhetskyandrey multidimensionalclassificationofbiomedicaltexttowardautomatedpracticalprovisionofhighutilitytexttodiverseusers
AT wilburwjohn multidimensionalclassificationofbiomedicaltexttowardautomatedpracticalprovisionofhighutilitytexttodiverseusers