Mel-frequency cepstrum

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

[1] They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum").

The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum.

MFCCs are commonly derived as follows:[2][3] There can be variations on this process, for example: differences in the shape or spacing of the windows used to map the scale,[4] or addition of dynamics features such as "delta" and "delta-delta" (first- and second-order frame-to-frame difference) coefficients.

[5] The European Telecommunications Standards Institute in the early 2000s defined a standardised MFCC algorithm to be used in mobile phones.

MFCCs are also increasingly finding uses in music information retrieval applications such as genre classification, audio similarity measures, etc.

[4] This type of mobile device recognition is possible because the production of electronic components in a phone have tolerances, because different electronic circuit realizations do not have exact same transfer functions.

The dissimilarities in the transfer function from one realization to another becomes more prominent if the task performing circuits are from different manufacturers.

Therefore, a particular phone can be identified from the recorded speech by multiplying the original frequency spectrum with further multiplications of transfer functions specific to each phone followed by signal processing techniques.

[5] Considering recording section of a cellphone as Linear time-invariant (LTI) filter: Impulse response- h(n), recorded speech signal y(n) as output of filter in response to input x(n).

The embedded identity of the cell phone requires a conversion to a better identifiable form, hence, taking short-time Fourier transform:

is the equivalent transfer function that characterizes the cell phone.

MFCC is successful because of this nonlinear transformation with additive property.

Transforming back to time domain: where, cy(j), ce(j), cw(j) are the recorded speech cepstrum and weighted equivalent impulse response of cell phone recorder that characterizes the cell phone, respectively, while j is the number of filters in the filter bank.

More precisely, the device specific information is in the recorded speech which is converted to additive form suitable for identification.

Hence, Mel-scale is a commonly used frequency scale that is linear till 1000 Hz and logarithmic above it.

Computation of central frequencies of filters in Mel-scale: Basic procedure for MFCC calculation:

An MFCC can be approximately inverted to audio in four steps: (a1) inverse DCT to obtain a mel log-power [dB] spectrogram, (a2) mapping to power to obtain a mel power spectrogram, (b1) rescaling to obtain short-time Fourier transform magnitudes, and finally (b2) phase reconstruction and audio synthesis using Griffin-Lim.

[9] MFCC values are not very robust in the presence of additive noise, and so it is common to normalise their values in speech recognition systems to lessen the influence of noise.

Some researchers propose modifications to the basic MFCC algorithm to improve robustness, such as by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the discrete cosine transform (DCT), which reduces the influence of low-energy components.

[10] Paul Mermelstein[11][12] is typically credited with the development of the MFC.

Mermelstein credits Bridle and Brown[13] for the idea: Bridle and Brown used a set of 19 weighted spectrum-shape coefficients given by the cosine transform of the outputs of a set of nonuniformly spaced bandpass filters.

[14] Many authors, including Davis and Mermelstein,[12] have commented that the spectral basis functions of the cosine transform in the MFC are very similar to the principal components of the log spectra, which were applied to speech representation and recognition much earlier by Pols and his colleagues.