Cargando…

Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling

This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with nov...

Descripción completa

Detalles Bibliográficos
Autores principales: Hajikhani, Arash, Pukelis, Lukas, Suominen, Arho, Ashouri, Sajad, Schubert, Torben, Notten, Ad, Cunningham, Scott W.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914545/
https://www.ncbi.nlm.nih.gov/pubmed/35284247
http://dx.doi.org/10.1016/j.mex.2022.101650
_version_ 1784667735625564160
author Hajikhani, Arash
Pukelis, Lukas
Suominen, Arho
Ashouri, Sajad
Schubert, Torben
Notten, Ad
Cunningham, Scott W.
author_facet Hajikhani, Arash
Pukelis, Lukas
Suominen, Arho
Ashouri, Sajad
Schubert, Torben
Notten, Ad
Cunningham, Scott W.
author_sort Hajikhani, Arash
collection PubMed
description This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources.
format Online
Article
Text
id pubmed-8914545
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-89145452022-03-12 Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling Hajikhani, Arash Pukelis, Lukas Suominen, Arho Ashouri, Sajad Schubert, Torben Notten, Ad Cunningham, Scott W. MethodsX Method Article This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources. Elsevier 2022-02-27 /pmc/articles/PMC8914545/ /pubmed/35284247 http://dx.doi.org/10.1016/j.mex.2022.101650 Text en © 2022 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Method Article
Hajikhani, Arash
Pukelis, Lukas
Suominen, Arho
Ashouri, Sajad
Schubert, Torben
Notten, Ad
Cunningham, Scott W.
Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling
title Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling
title_full Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling
title_fullStr Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling
title_full_unstemmed Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling
title_short Connecting firm's web scraped textual content to body of science: Utilizing microsoft academic graph hierarchical topic modeling
title_sort connecting firm's web scraped textual content to body of science: utilizing microsoft academic graph hierarchical topic modeling
topic Method Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914545/
https://www.ncbi.nlm.nih.gov/pubmed/35284247
http://dx.doi.org/10.1016/j.mex.2022.101650
work_keys_str_mv AT hajikhaniarash connectingfirmswebscrapedtextualcontenttobodyofscienceutilizingmicrosoftacademicgraphhierarchicaltopicmodeling
AT pukelislukas connectingfirmswebscrapedtextualcontenttobodyofscienceutilizingmicrosoftacademicgraphhierarchicaltopicmodeling
AT suominenarho connectingfirmswebscrapedtextualcontenttobodyofscienceutilizingmicrosoftacademicgraphhierarchicaltopicmodeling
AT ashourisajad connectingfirmswebscrapedtextualcontenttobodyofscienceutilizingmicrosoftacademicgraphhierarchicaltopicmodeling
AT schuberttorben connectingfirmswebscrapedtextualcontenttobodyofscienceutilizingmicrosoftacademicgraphhierarchicaltopicmodeling
AT nottenad connectingfirmswebscrapedtextualcontenttobodyofscienceutilizingmicrosoftacademicgraphhierarchicaltopicmodeling
AT cunninghamscottw connectingfirmswebscrapedtextualcontenttobodyofscienceutilizingmicrosoftacademicgraphhierarchicaltopicmodeling