Audio mining

[4] The audio will typically be processed by a speech recognition system in order to identify word or phoneme units that are likely to occur in the spoken content.

One or more audio mining index files can then be loaded at a later date in order to run searches for keywords or phrases.

The results of a search will normally be in terms of hits, which are regions within files that are good matches for the chosen keywords.

The user may then be able to listen to the audio corresponding to these hits in order to verify if a correct match was found.

In audio, there is the main problem of information retrieval - there is a need to locate the text documents that contain the search key.

In text-based indexing or large vocabulary continuous speech recognition (LVCSR), the audio file is first broken down into recognizable phonemes.

It is then run through a dictionary that can contain several hundred thousand entries and matched with words and phrases to produce a full text transcript.

A user can then simply search a desired word term and the relevant portion of the audio content will be returned.

[6] Meanwhile, while initial processing of audio takes a fair bit of time, searching is quick as just a simple test to text matching is needed.

The inherent random nature of audio and problems of external noise all affect the accuracies of text-based indexing.

Audio mining systems try to cope with OOV by continuously updating the dictionary and language model used, but the problem still remains significant and has probed a search for alternatives.

[7] Additionally, due to the need to constantly update and maintain task-based knowledge and large training databases to cope with the OOV problem, high computational costs are incurred.

Then, multiple PAT files can be scanned at high speed during a single search for likely phonetic sequences that closely match corresponding strings of phonemes in the query term.

[8][9] Phonetic indexing is most attractive as it is largely unaffected by linguistic issues such as unrecognized words and spelling errors.

Unless the system recognizes exactly the entire word, or understands phonetic sequences of languages, it is difficult for phonetic-based indexing to return accurate findings.

Searches can then be carried out to find pieces of music that are similar in terms of their melodic, harmonic and/or rhythmic characteristics.

[17] The efficiency of audio mining in processing audio-visual data lends aid in speaker identification and segmentation, as well as text transcription.

Call centers have used the technology to conduct real time analysis by identifying changes in tone, sentiment or pitch, amongst others, which is then processed by decision engine or artificial intelligence to take further action.