Cargando…
Creating Classification Models from Textual Descriptions of Companies Using Crunchbase
This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the propos...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274320/ http://dx.doi.org/10.1007/978-3-030-50146-4_51 |
_version_ | 1783542555591311360 |
---|---|
author | Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo |
author_facet | Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo |
author_sort | Felgueiras, Marco |
collection | PubMed |
description | This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc. |
format | Online Article Text |
id | pubmed-7274320 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-72743202020-06-05 Creating Classification Models from Textual Descriptions of Companies Using Crunchbase Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo Information Processing and Management of Uncertainty in Knowledge-Based Systems Article This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc. 2020-05-18 /pmc/articles/PMC7274320/ http://dx.doi.org/10.1007/978-3-030-50146-4_51 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Felgueiras, Marco Batista, Fernando Carvalho, Joao Paulo Creating Classification Models from Textual Descriptions of Companies Using Crunchbase |
title | Creating Classification Models from Textual Descriptions of Companies Using Crunchbase |
title_full | Creating Classification Models from Textual Descriptions of Companies Using Crunchbase |
title_fullStr | Creating Classification Models from Textual Descriptions of Companies Using Crunchbase |
title_full_unstemmed | Creating Classification Models from Textual Descriptions of Companies Using Crunchbase |
title_short | Creating Classification Models from Textual Descriptions of Companies Using Crunchbase |
title_sort | creating classification models from textual descriptions of companies using crunchbase |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274320/ http://dx.doi.org/10.1007/978-3-030-50146-4_51 |
work_keys_str_mv | AT felgueirasmarco creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase AT batistafernando creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase AT carvalhojoaopaulo creatingclassificationmodelsfromtextualdescriptionsofcompaniesusingcrunchbase |