Cargando…

Creating Classification Models from Textual Descriptions of Companies Using Crunchbase

This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the propos...

Descripción completa

Detalles Bibliográficos
Autores principales:	Felgueiras, Marco, Batista, Fernando, Carvalho, Joao Paulo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274320/ http://dx.doi.org/10.1007/978-3-030-50146-4_51

_version_	1783542555591311360
author	Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo
author_facet	Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo
author_sort	Felgueiras, Marco
collection	PubMed
description	This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc.
format	Online Article Text
id	pubmed-7274320
institution	National Center for Biotechnology Information
language	English
publishDate	2020
record_format	MEDLINE/PubMed
spelling	pubmed-72743202020-06-05 Creating Classification Models from Textual Descriptions of Companies Using Crunchbase Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo Information Processing and Management of Uncertainty in Knowledge-Based Systems Article This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc. 2020-05-18 /pmc/articles/PMC7274320/ http://dx.doi.org/10.1007/978-3-030-50146-4_51 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title	Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_full	Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_fullStr	Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_full_unstemmed	Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_short	Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_sort	creating classification models from textual descriptions of companies using crunchbase
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274320/ http://dx.doi.org/10.1007/978-3-030-50146-4_51
work_keys_str_mv	AT felgueirasmarco creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase AT batistafernando creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase AT carvalhojoaopaulo creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase

Creating Classification Models from Textual Descriptions of Companies Using Crunchbase

Ejemplares similares