Cargando…

A Method to Generate Soft Reference Data for Topic Identification

Text mining and topic identification models are becoming increasingly relevant to extract value from the huge amount of unstructured textual information that companies obtain from their users and clients nowadays. Soft approaches to these problems are also gaining relevance, as in some contexts it m...

Descripción completa

Detalles Bibliográficos
Autores principales: Vélez, Daniel, Villarino, Guillermo, Rodríguez, J. Tinguaro, Gómez, Daniel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274723/
http://dx.doi.org/10.1007/978-3-030-50153-2_5
_version_ 1783542646857269248
author Vélez, Daniel
Villarino, Guillermo
Rodríguez, J. Tinguaro
Gómez, Daniel
author_facet Vélez, Daniel
Villarino, Guillermo
Rodríguez, J. Tinguaro
Gómez, Daniel
author_sort Vélez, Daniel
collection PubMed
description Text mining and topic identification models are becoming increasingly relevant to extract value from the huge amount of unstructured textual information that companies obtain from their users and clients nowadays. Soft approaches to these problems are also gaining relevance, as in some contexts it may be unrealistic to assume that any document has to be associated to a single topic without any further consideration of the involved uncertainties. However, there is an almost total lack of reference documents allowing a proper assessment of the performance of soft classifiers in such soft topic identification tasks. To address this lack, in this paper a method is proposed that generates topic identification reference documents with a soft but objective nature, and which proceeds by combining, in random but known proportions, phrases of existing documents dealing with different topics. We also provide a computational study illustrating the application of the proposed method on a well-known benchmark for topic identification, as well as showing the possibility of carrying out an informative evaluation of soft classifiers in the context of soft topic identification.
format Online
Article
Text
id pubmed-7274723
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72747232020-06-08 A Method to Generate Soft Reference Data for Topic Identification Vélez, Daniel Villarino, Guillermo Rodríguez, J. Tinguaro Gómez, Daniel Information Processing and Management of Uncertainty in Knowledge-Based Systems Article Text mining and topic identification models are becoming increasingly relevant to extract value from the huge amount of unstructured textual information that companies obtain from their users and clients nowadays. Soft approaches to these problems are also gaining relevance, as in some contexts it may be unrealistic to assume that any document has to be associated to a single topic without any further consideration of the involved uncertainties. However, there is an almost total lack of reference documents allowing a proper assessment of the performance of soft classifiers in such soft topic identification tasks. To address this lack, in this paper a method is proposed that generates topic identification reference documents with a soft but objective nature, and which proceeds by combining, in random but known proportions, phrases of existing documents dealing with different topics. We also provide a computational study illustrating the application of the proposed method on a well-known benchmark for topic identification, as well as showing the possibility of carrying out an informative evaluation of soft classifiers in the context of soft topic identification. 2020-05-16 /pmc/articles/PMC7274723/ http://dx.doi.org/10.1007/978-3-030-50153-2_5 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Vélez, Daniel
Villarino, Guillermo
Rodríguez, J. Tinguaro
Gómez, Daniel
A Method to Generate Soft Reference Data for Topic Identification
title A Method to Generate Soft Reference Data for Topic Identification
title_full A Method to Generate Soft Reference Data for Topic Identification
title_fullStr A Method to Generate Soft Reference Data for Topic Identification
title_full_unstemmed A Method to Generate Soft Reference Data for Topic Identification
title_short A Method to Generate Soft Reference Data for Topic Identification
title_sort method to generate soft reference data for topic identification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274723/
http://dx.doi.org/10.1007/978-3-030-50153-2_5
work_keys_str_mv AT velezdaniel amethodtogeneratesoftreferencedatafortopicidentification
AT villarinoguillermo amethodtogeneratesoftreferencedatafortopicidentification
AT rodriguezjtinguaro amethodtogeneratesoftreferencedatafortopicidentification
AT gomezdaniel amethodtogeneratesoftreferencedatafortopicidentification
AT velezdaniel methodtogeneratesoftreferencedatafortopicidentification
AT villarinoguillermo methodtogeneratesoftreferencedatafortopicidentification
AT rodriguezjtinguaro methodtogeneratesoftreferencedatafortopicidentification
AT gomezdaniel methodtogeneratesoftreferencedatafortopicidentification