[1][2] Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind.
The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems — "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents.
It also differs from problems of speech understanding by machine since it deals with general audio signals, such as natural sounds and musical recordings.
Computer Audition overlaps with the following disciplines: Since audio signals are interpreted by the human ear–brain system, that complex perceptual mechanism should be simulated somehow in software for "machine listening".
One of the unique properties of musical signals is that they often combine different types of representations, such as graphical scores and sequences of performance actions that are encoded as MIDI files.
Additional types of data that are relevant for computer audition are textual descriptions of audio contents, such as annotations, reviews, and visual information in the case of audio-visual recordings.
Other source separation methods rely on training or clustering of features in mono recording, such as tracking harmonically related partials for multiple pitch detection.
Some methods, before explicit recognition, rely on revealing structures in data without knowing the structures (like recognizing objects in abstract pictures without attributing them meaningful labels) by finding the least complex data representations, for instance describing audio scenes as generated by a few tone patterns and their trajectories (polyphonic voices) and acoustical contours drawn by a tone (chords).
This consists of analysis of larger repetition and self-similarity structures in audio to detect innovation, as well as ability to predict local feature dynamics.