Computational auditory scene analysis

CASA differs from the field of blind signal separation in that it is (at least to some extent) based on the mechanisms of the human auditory system, and thus uses no more than two microphone recordings of an acoustic environment.

Neurons in higher auditory pathway centers are tuned to specific stimuli features, such as periodicity, sound intensity, amplitude and frequency modulation.

Studies have found that impairments in ASA and segregation and grouping operations are affected in patients with Alzheimer's disease.

By mimicking the components of the outer and middle ear, the signal is broken up into different frequencies that are naturally selected by the cochlea and hair cells.

This model replicates many of the nerve responses in the CASA systems such as rectification, compression, spontaneous firing, and adaptation.

[1] By pooling the autocorrelation across frequency, the position of peaks in the summary correlogram corresponds to the perceived pitch.

[6] By cross-correlating the delays from the left and right channels (of the model), the coincided peaks can be categorized as the same localized sound, despite their temporal location in the input signal.

[1] The use of interaural cross-correlation mechanism has been supported through physiological studies, paralleling the arrangement of neurons in the auditory midbrain.

Malsburg and Schneider proposed a neural network model with oscillators to represent features of different streams (synchronized and desynchronized).

[13] Wang also presented a model using a network of excitatory units with a global inhibitor with delay lines to represent the auditory scene within the time-frequency.

[16] Instead of breaking the audio signal down to individual constituents, the input is broken down of by higher level descriptors, such as chords, bass and melody, beat structure, and chorus and phrase repetitions.

[17] Chord detection can be implemented through pattern recognition, by extracting low-level features describing harmonic content.

[18] The techniques utilized in music scene analysis can also be applied to speech recognition, and other environmental sounds.

Hierarchical coding models many cells to encode all possible combinations of features and objects in the auditory scene.