Audio inpainting

[2] The goal of audio inpainting is to fill in the gaps (i.e., the missing portions) in the audio signal seamlessly, making the reconstructed portions indistinguishable from the original content and avoiding the introduction of audible distortions or alterations.

Classic methods employ statistical models or digital signal processing algorithms [1][4][5] to predict and synthesize the missing or damaged sections.

Recent solutions, instead, take advantage of deep learning models, thanks to the growing trend of exploiting data-driven methods in the context of audio restoration.

In long inpainting instead, with gaps in the order of hundreds of milliseconds or even seconds, this goal becomes unrealistic, since restoration techniques cannot rely on local information.

[3] The case of medium duration gaps lays between short and long inpainting.

It refers to the reconstruction of tens of millisecond of missing data, a scale where the non-stationary characteristic of audio already becomes important.

is a distance measure term that computes the reconstruction accuracy between the corrupted audio signal and the estimated one.

It is thus necessary to add a constraint to the minimization, in order to restrict the results only to the valid solutions.

can express assumptions on the stationarity of the signal, on the sparsity of its representation or can be learned from data.

These can vary significantly, influenced by factors such as the specific application requirements, the length of the gaps and the available data.

[2] Model-based techniques involve the exploitation of mathematical models or assumptions about the underlying structure of the audio signal.

These models can be based on prior knowledge of the audio content or statistical properties observed in the data.

By leveraging these models, missing or corrupted portions of the audio signal can be inferred or estimated.

[5][13] Some more recent techniques approach audio inpainting by representing audio signals as sparse linear combinations of a limited number of basis functions (as for example in the Short Time Fourier Transform).

[2] As a way to overcome these limitations, some approaches add strong assumptions also about the fundamental structure of the gap itself, exploiting sinusoidal modeling [16] or similarity graphs [8] to perform inpainting of longer missing portions of audio signals.

Once trained, these models can be used to generate missing portions of the audio signal based on the learned representations, without being restricted by stationarity assumptions.

[3] Data-driven techniques also offer the advantage of adaptability and flexibility, as they can learn from diverse audio datasets and potentially handle complex inpainting scenarios.

[3] As of today, such techniques constitute the state-of-the-art of audio inpainting, being able to reconstruct gaps of hundreds of milliseconds or even seconds.

[17] In GAN-based inpaniting methods the generator acts as a context encoder and produces a plausible completion for the gap only given the available information surrounding it.

[3] The discriminator is used to train the generator and tests the consistency of the produced inpainted audio.

For this reason they have also been used to solve the audio inpainting problem, obtaining valid results.

[2] One drawback of generative models is that they typically need a huge amount of training data.

[6] Nonetheless, some works demonstrated that, capturing the essence of an audio signal is also possible using only a few tens of seconds from a single training sample.

[6][18][19] This is done by overfitting a generative neural network to a single training audio signal.

It can also be employed to recover deteriorated old recordings that have been affected by local modifications or have missing audio samples due to scratches on CDs.

[2] Audio inpainting is also closely related to packet loss concealment (PLC).

In the PLC problem, it is necessary to compensate the loss of audio packets in communication networks.