[Zurück]


Wissenschaftliche Berichte:

T. Lidy, A. Schindler:
"CQT-based convolutional neural networks for audio scene classification and domestic audio tagging";
2016; 6 S.



Kurzfassung englisch:
For the DCASE 2016 challenge on detection and classification of
acoustic scenes and events we submitted a parallel Convolutional
Neural Network architecture for the tasks of classifying acoustic scenes and urban sound scapes (task 1) and domestic audio tagging (task 4). A popular choice for input to a Convolutional Neural Network in audio classification problems are Mel-transformed spectrograms. We, however, found that a Constant-Q-transformed input improves results. Furthermore, we evaluated critical parameters such as the number of necessary bands and filter sizes in a Convolutional Neural Network. Finally, we propose a parallel (graph-based) neural network architecture, which captures relevant audio characteristics both in time and in frequency, and submitted it to the DCASE 2016 tasks 1 and 4. For the acoustic scenes classification task our approach scored 80.25 % accuracy on the development set, a 10.7 % relative improvement of the DCASE baseline system [1], and achieved 83.3 % on the evaluation set (rank 14 of 35) in the challenge. On the domestic audio tagging task, our approach is the winning algorithm (rank 1 of 9) with 16.6 % equal error rate.

Schlagworte:
Neural networks, Deep learning, Classification, Audio, Audio Event Classification, Convolutional Neural Networks, CQT, Constant-Q-Transform, Mel Spectrogram


Elektronische Version der Publikation:
http://publik.tuwien.ac.at/files/publik_256009.pdf


Erstellt aus der Publikationsdatenbank der Technischen Universität Wien.