T. Lidy, A. Schindler:
"CQT-based convolutional neural networks for audio scene classification and domestic audio tagging";
For the DCASE 2016 challenge on detection and classification of
acoustic scenes and events we submitted a parallel Convolutional
Neural Network architecture for the tasks of classifying acoustic scenes and urban sound scapes (task 1) and domestic audio tagging (task 4). A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, found that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. Finally, we propose a parallel (graph-based) neural network architecture, which captures relevant audio characteristics both in time and in frequency, and submitted it to the DCASE 2016 tasks 1 and 4. For the acoustic scenes classification task our approach scored 80.25 % accuracy on the development set, a 10.7 % relative improvement of the DCASE baseline system , and achieved 83.3 % on the evaluation set (rank 14 of 35) in the challenge. On the domestic audio tagging task, our approach is the winning algorithm (rank 1 of 9) with 16.6 % equal error rate.
Neural networks, Deep learning, Classification, Audio, Audio Event Classification, Convolutional Neural Networks, CQT, Constant-Q-Transform, Mel Spectrogram
Electronic version of the publication:
Created from the Publication Database of the Vienna University of Technology.