Cargando…

Creating Classification Models from Textual Descriptions of Companies Using Crunchbase

This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the propos...

Descripción completa

Detalles Bibliográficos
Autores principales: Felgueiras, Marco, Batista, Fernando, Carvalho, Joao Paulo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274320/
http://dx.doi.org/10.1007/978-3-030-50146-4_51
_version_ 1783542555591311360
author Felgueiras, Marco
Batista, Fernando
Carvalho, Joao Paulo
author_facet Felgueiras, Marco
Batista, Fernando
Carvalho, Joao Paulo
author_sort Felgueiras, Marco
collection PubMed
description This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc.
format Online
Article
Text
id pubmed-7274320
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72743202020-06-05 Creating Classification Models from Textual Descriptions of Companies Using Crunchbase Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo Information Processing and Management of Uncertainty in Knowledge-Based Systems Article This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc. 2020-05-18 /pmc/articles/PMC7274320/ http://dx.doi.org/10.1007/978-3-030-50146-4_51 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Felgueiras, Marco
Batista, Fernando
Carvalho, Joao Paulo
Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_full Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_fullStr Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_full_unstemmed Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_short Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
title_sort creating classification models from textual descriptions of companies using crunchbase
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274320/
http://dx.doi.org/10.1007/978-3-030-50146-4_51
work_keys_str_mv AT felgueirasmarco creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase
AT batistafernando creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase
AT carvalhojoaopaulo creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase