Cargando…

Classifying domain-specific text documents containing ambiguous keywords

A keyword-based search of comprehensive databases such as PubMed may return irrelevant papers, especially if the keywords are used in multiple fields of study. In such cases, domain experts (curators) need to verify the results and remove the irrelevant articles. Automating this filtering process wi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Karimi, Kamran, Agalakov, Sergei, Telmer, Cheryl A, Beatman, Thomas R, Pells, Troy J, Arshinoff, Bradley I M, Ku, Carolyn J, Foley, Saoirse, Hinman, Veronica F, Ettensohn, Charles A, Vize, Peter D
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Database Tool
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8588847/ https://www.ncbi.nlm.nih.gov/pubmed/34585729 http://dx.doi.org/10.1093/database/baab062

_version_	1784598575898951680
author	Karimi, Kamran Agalakov, Sergei Telmer, Cheryl A Beatman, Thomas R Pells, Troy J Arshinoff, Bradley I M Ku, Carolyn J Foley, Saoirse Hinman, Veronica F Ettensohn, Charles A Vize, Peter D
author_facet	Karimi, Kamran Agalakov, Sergei Telmer, Cheryl A Beatman, Thomas R Pells, Troy J Arshinoff, Bradley I M Ku, Carolyn J Foley, Saoirse Hinman, Veronica F Ettensohn, Charles A Vize, Peter D
author_sort	Karimi, Kamran
collection	PubMed
description	A keyword-based search of comprehensive databases such as PubMed may return irrelevant papers, especially if the keywords are used in multiple fields of study. In such cases, domain experts (curators) need to verify the results and remove the irrelevant articles. Automating this filtering process will save time, but it has to be done well enough to ensure few relevant papers are rejected and few irrelevant papers are accepted. A good solution would be fast, work with the limited amount of data freely available (full paper body may be missing), handle ambiguous keywords and be as domain-neutral as possible. In this paper, we evaluate a number of classification algorithms for identifying a domain-specific set of papers about echinoderm species and show that the resulting tool satisfies most of the abovementioned requirements. Echinoderms consist of a number of very different organisms, including brittle stars, sea stars (starfish), sea urchins and sea cucumbers. While their taxonomic identifiers are specific, the common names are used in many other contexts, creating ambiguity and making a keyword search prone to error. We try classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM, Bagging, AdaBoost and Neural Network learning models and compare their performance. We show how effective the resulting classifiers are in filtering irrelevant articles returned from PubMed. The methodology used is more dependent on the good selection of training data and is a practical solution that can be applied to other fields of study facing similar challenges. Database URL The code and date reported in this paper are freely available at http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/
format	Online Article Text
id	pubmed-8588847
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-85888472021-11-15 Classifying domain-specific text documents containing ambiguous keywords Karimi, Kamran Agalakov, Sergei Telmer, Cheryl A Beatman, Thomas R Pells, Troy J Arshinoff, Bradley I M Ku, Carolyn J Foley, Saoirse Hinman, Veronica F Ettensohn, Charles A Vize, Peter D Database (Oxford) Database Tool A keyword-based search of comprehensive databases such as PubMed may return irrelevant papers, especially if the keywords are used in multiple fields of study. In such cases, domain experts (curators) need to verify the results and remove the irrelevant articles. Automating this filtering process will save time, but it has to be done well enough to ensure few relevant papers are rejected and few irrelevant papers are accepted. A good solution would be fast, work with the limited amount of data freely available (full paper body may be missing), handle ambiguous keywords and be as domain-neutral as possible. In this paper, we evaluate a number of classification algorithms for identifying a domain-specific set of papers about echinoderm species and show that the resulting tool satisfies most of the abovementioned requirements. Echinoderms consist of a number of very different organisms, including brittle stars, sea stars (starfish), sea urchins and sea cucumbers. While their taxonomic identifiers are specific, the common names are used in many other contexts, creating ambiguity and making a keyword search prone to error. We try classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM, Bagging, AdaBoost and Neural Network learning models and compare their performance. We show how effective the resulting classifiers are in filtering irrelevant articles returned from PubMed. The methodology used is more dependent on the good selection of training data and is a practical solution that can be applied to other fields of study facing similar challenges. Database URL The code and date reported in this paper are freely available at http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/ Oxford University Press 2021-11-26 /pmc/articles/PMC8588847/ /pubmed/34585729 http://dx.doi.org/10.1093/database/baab062 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Database Tool Karimi, Kamran Agalakov, Sergei Telmer, Cheryl A Beatman, Thomas R Pells, Troy J Arshinoff, Bradley I M Ku, Carolyn J Foley, Saoirse Hinman, Veronica F Ettensohn, Charles A Vize, Peter D Classifying domain-specific text documents containing ambiguous keywords
title	Classifying domain-specific text documents containing ambiguous keywords
title_full	Classifying domain-specific text documents containing ambiguous keywords
title_fullStr	Classifying domain-specific text documents containing ambiguous keywords
title_full_unstemmed	Classifying domain-specific text documents containing ambiguous keywords
title_short	Classifying domain-specific text documents containing ambiguous keywords
title_sort	classifying domain-specific text documents containing ambiguous keywords
topic	Database Tool
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8588847/ https://www.ncbi.nlm.nih.gov/pubmed/34585729 http://dx.doi.org/10.1093/database/baab062
work_keys_str_mv	AT karimikamran classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT agalakovsergei classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT telmercheryla classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT beatmanthomasr classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT pellstroyj classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT arshinoffbradleyim classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT kucarolynj classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT foleysaoirse classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT hinmanveronicaf classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT ettensohncharlesa classifyingdomainspecifictextdocumentscontainingambiguouskeywords AT vizepeterd classifyingdomainspecifictextdocumentscontainingambiguouskeywords

Classifying domain-specific text documents containing ambiguous keywords

Ejemplares similares