Cargando…

What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities

BACKGROUND: Because of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists freq...

Descripción completa

Detalles Bibliográficos
Autor principal: Gurcan, Fatih
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280584/
https://www.ncbi.nlm.nih.gov/pubmed/37346688
http://dx.doi.org/10.7717/peerj-cs.1361
_version_ 1785060828306735104
author Gurcan, Fatih
author_facet Gurcan, Fatih
author_sort Gurcan, Fatih
collection PubMed
description BACKGROUND: Because of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists frequently turn to various forums, particularly domain-specific Q&A websites, to solve difficulties. These websites evolve into data science knowledge repositories over time. Analysis of such repositories can provide valuable insights into the applications, topics, trends, and challenges of data science. METHODS: In this article, we investigated what data scientists are asking by analyzing all posts to date on DSSE, a data science-focused Q&A website. To discover main topics embedded in data science discussions, we used latent Dirichlet allocation (LDA), a probabilistic approach for topic modeling. RESULTS: As a result of this analysis, 18 main topics were identified that demonstrate the current interests and issues in data science. We then examined the topics’ popularity and difficulty. In addition, we identified the most commonly used tasks, techniques, and tools in data science. As a result, “Model Training”, “Machine Learning”, and “Neural Networks” emerged as the most prominent topics. Also, “Data Manipulation”, “Coding Errors”, and “Tools” were identified as the most viewed (most popular) topics. On the other hand, the most difficult topics were identified as “Time Series”, “Computer Vision”, and “Recommendation Systems”. Our findings have significant implications for many data science stakeholders who are striving to advance data-driven architectures, concepts, tools, and techniques.
format Online
Article
Text
id pubmed-10280584
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-102805842023-06-21 What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities Gurcan, Fatih PeerJ Comput Sci Data Mining and Machine Learning BACKGROUND: Because of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists frequently turn to various forums, particularly domain-specific Q&A websites, to solve difficulties. These websites evolve into data science knowledge repositories over time. Analysis of such repositories can provide valuable insights into the applications, topics, trends, and challenges of data science. METHODS: In this article, we investigated what data scientists are asking by analyzing all posts to date on DSSE, a data science-focused Q&A website. To discover main topics embedded in data science discussions, we used latent Dirichlet allocation (LDA), a probabilistic approach for topic modeling. RESULTS: As a result of this analysis, 18 main topics were identified that demonstrate the current interests and issues in data science. We then examined the topics’ popularity and difficulty. In addition, we identified the most commonly used tasks, techniques, and tools in data science. As a result, “Model Training”, “Machine Learning”, and “Neural Networks” emerged as the most prominent topics. Also, “Data Manipulation”, “Coding Errors”, and “Tools” were identified as the most viewed (most popular) topics. On the other hand, the most difficult topics were identified as “Time Series”, “Computer Vision”, and “Recommendation Systems”. Our findings have significant implications for many data science stakeholders who are striving to advance data-driven architectures, concepts, tools, and techniques. PeerJ Inc. 2023-05-18 /pmc/articles/PMC10280584/ /pubmed/37346688 http://dx.doi.org/10.7717/peerj-cs.1361 Text en © 2023 Gurcan https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Data Mining and Machine Learning
Gurcan, Fatih
What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities
title What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities
title_full What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities
title_fullStr What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities
title_full_unstemmed What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities
title_short What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities
title_sort what issues are data scientists talking about? identification of current data science issues using semantic content analysis of q&a communities
topic Data Mining and Machine Learning
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280584/
https://www.ncbi.nlm.nih.gov/pubmed/37346688
http://dx.doi.org/10.7717/peerj-cs.1361
work_keys_str_mv AT gurcanfatih whatissuesaredatascientiststalkingaboutidentificationofcurrentdatascienceissuesusingsemanticcontentanalysisofqacommunities