Cargando…

Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition

We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted fe...

Descripción completa

Detalles Bibliográficos
Autores principales: Wilkes, Ben, Vatolkin, Igor, Müller, Heinrich
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8621318/
https://www.ncbi.nlm.nih.gov/pubmed/34828199
http://dx.doi.org/10.3390/e23111502
_version_ 1784605428981694464
author Wilkes, Ben
Vatolkin, Igor
Müller, Heinrich
author_facet Wilkes, Ben
Vatolkin, Igor
Müller, Heinrich
author_sort Wilkes, Ben
collection PubMed
description We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted features designed for a respective modality are also integrated, allowing for higher interpretability of created models and further theoretical analysis of the impact of individual features on genre prediction. Genre recognition is performed by binary classification of a music track with respect to each genre based on combinations of elementary features. For feature combination a two-level technique is used, which combines aggregation into fixed-length feature vectors with confidence-based fusion of classification results. Extensive experiments have been conducted for three classifier models (Naïve Bayes, Support Vector Machine, and Random Forest) and numerous feature combinations. The results are presented visually, with data reduction for improved perceptibility achieved by multi-objective analysis and restriction to non-dominated data. Feature- and classifier-related hypotheses are formulated based on the data, and their statistical significance is formally analyzed. The statistical analysis shows that the combination of two modalities almost always leads to a significant increase of performance and the combination of three modalities in several cases.
format Online
Article
Text
id pubmed-8621318
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-86213182021-11-27 Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition Wilkes, Ben Vatolkin, Igor Müller, Heinrich Entropy (Basel) Article We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted features designed for a respective modality are also integrated, allowing for higher interpretability of created models and further theoretical analysis of the impact of individual features on genre prediction. Genre recognition is performed by binary classification of a music track with respect to each genre based on combinations of elementary features. For feature combination a two-level technique is used, which combines aggregation into fixed-length feature vectors with confidence-based fusion of classification results. Extensive experiments have been conducted for three classifier models (Naïve Bayes, Support Vector Machine, and Random Forest) and numerous feature combinations. The results are presented visually, with data reduction for improved perceptibility achieved by multi-objective analysis and restriction to non-dominated data. Feature- and classifier-related hypotheses are formulated based on the data, and their statistical significance is formally analyzed. The statistical analysis shows that the combination of two modalities almost always leads to a significant increase of performance and the combination of three modalities in several cases. MDPI 2021-11-12 /pmc/articles/PMC8621318/ /pubmed/34828199 http://dx.doi.org/10.3390/e23111502 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Wilkes, Ben
Vatolkin, Igor
Müller, Heinrich
Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition
title Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition
title_full Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition
title_fullStr Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition
title_full_unstemmed Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition
title_short Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition
title_sort statistical and visual analysis of audio, text, and image features for multi-modal music genre recognition
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8621318/
https://www.ncbi.nlm.nih.gov/pubmed/34828199
http://dx.doi.org/10.3390/e23111502
work_keys_str_mv AT wilkesben statisticalandvisualanalysisofaudiotextandimagefeaturesformultimodalmusicgenrerecognition
AT vatolkinigor statisticalandvisualanalysisofaudiotextandimagefeaturesformultimodalmusicgenrerecognition
AT mullerheinrich statisticalandvisualanalysisofaudiotextandimagefeaturesformultimodalmusicgenrerecognition