Perceptual-based 3D sound localization

Human listeners combine information from two ears to localize and separate sound sources originating in different locations in a process called binaural hearing.

The powerful signal processing methods found in the neural systems and brains of humans and other animals are flexible, environmentally adaptable,[1] and take place rapidly and seemingly without effort.

[2] Emulating the mechanisms of binaural hearing can improve recognition accuracy and signal separation in DSP algorithms, especially in noisy environments.

We can track each such sound source, by using a probabilistic temporal integration, based on data obtained through a microphone array and a particle filtering tracker.

These approaches can be applied to selective reconstructions of spatialized signals, where spectrotemporal components believed to be dominated by the desired sound source are identified and isolated through the Short-time Fourier transform (STFT).

Modern systems typically compute the STFT of the incoming signal from two or more microphones, and estimate the ITD or each spectrotemporal component by comparing the phases of the STFTs.

[1] Another advantage is that the ITD is relatively strong and easy to obtain without biomimetic instruments such as dummy heads and artificial pinnae, though these may still be used to enhance amplitude disparities.

They provide salient cues for localizing high-frequency sounds in space, and populations of neurons that are sensitive to ILD are found at almost every synaptic level from brain stem to cortex.

Interaural time and level differences (ITD, ILD) play a role in azimuth perception but can't explain vertical localization.

Processing of the precedence effect involves enhancing the leading edge of sound envelopes of the signal after dividing it into frequency bands via bandpass filtering.

[10] The average human has the remarkable ability to locate a sound source with better than 5◦ accuracy in both azimuth and elevation, in challenging environments.