Download soundmagic spectral

The minimum duration of audio excerpts required by both the traditional and CNN-based methods was assessed as 3 s. A minimum of 20 HRTFs are required to achieve a satisfactory generalization performance for the traditional algorithm and 30 HRTFs for CNN. However, under the HRTF-independent test scenario, CNN performed worse than the traditional algorithm, highlighting the importance of testing the algorithms under HRTF-independent conditions and indicating that the traditional method might be more generalizable than CNN. According to the results obtained under HRTF-dependent test conditions, CNN showed a very high discrimination accuracy (99.4%), slightly outperforming the traditional method. The discrimination method was developed based on the traditional approach, involving hand-engineering of features, as well as using a deep learning technique incorporating the convolutional neural network (CNN). To this end, 22, 496 binaural excerpts, representing either front or back located ensembles, were synthesized by convolving multi-track music recordings with 74 sets of head-related transfer functions (HRTF). The goal of this work was to develop a method for discriminating between front and back located ensembles in binaural recordings of music. One of the greatest challenges in the development of binaural machine audition systems is the disambiguation between front and back audio sources, particularly in complex spatial audio scenes. However, when the algorithms were tested under the BRIR mismatched scenario, the accuracy obtained by the algorithms was comparable to that exhibited by the listeners who passed the post-screening test, implying that the machine learning algorithms capability to perform in unknown electro-acoustic conditions needs to be further improved. The machine learning algorithms substantially outperformed human listeners, with the classification accuracy reaching 84%, when tested under the binaural-room-impulse-response (BRIR) matched conditions.

The above classification task was also undertaken automatically using four machine learning algorithms: convolutional neural network, support vector machines, extreme gradient boosting framework, and logistic regression. For the listeners who passed the post-screening test, the accuracy was greater, approaching 60%. In the listening test, undertaken remotely over the Internet, human listeners reached the classification accuracy of 42.5%. The three scenes were subject to classification: (1) music ensemble (a group of musical sources) located in the front, (2) music ensemble located at the back, and (3) music ensemble distributed around a listener. The purpose of this paper is to compare the performance of human listeners against the selected machine learning algorithms in the task of the classification of spatial audio scenes in binaural recordings of music under practical conditions.