Neural style transfer

Neural style transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image.

NST algorithms are characterized by their use of deep neural networks for the sake of image transformation.

Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma.

This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

NST is an example of image stylization, a problem studied for over two decades within the field of non-photorealistic rendering.

[2] Both of these methods were based on patch-based texture synthesis algorithms.

NST was first published in the paper "A Neural Algorithm of Artistic Style" by Leon Gatys et al., originally released to ArXiv 2015,[3] and subsequently accepted by the peer-reviewed CVPR conference in 2016.

[4] The original paper used a VGG-19 architecture[5] that has been pre-trained to perform object recognition using the ImageNet dataset.

In 2017, Google AI introduced a method[6] that allows a single deep convolutional style transfer network to learn multiple styles at the same time.

This algorithm permits style interpolation in real-time, even when done on video media.

[4] The idea of Neural Style Transfer (NST) is to take two images—a content image

The content similarity is the weighted sum of squared-differences between the neural activations of a single convolutional neural network (CNN) on two images.

The style similarity is the weighted sum of Gram matrices within each layer (see below for details).

is encoded in each layer of the CNN by the filter responses to that image, with higher layers encoding more global features, but losing details on local features.

The content loss is defined as the squared-error loss between the feature representations of the generated image and the content image at a chosen layer

Minimizing this loss encourages the generated image to have similar content to the content image, as captured by the feature activations in the chosen layer.

The style loss is based on the Gram matrices of the generated and style images, which capture the correlations between different filter responses at different layers of the CNN:

are the entries of the Gram matrices for the generated and style images at layer

Minimizing this loss encourages the generated image to have similar style characteristics to the style image, as captured by the correlations between feature responses in each layer.

The idea is that activation pattern correlations between filters in a single layer captures the "style" on the order of the receptive fields at that layer.

is initially approximated by adding a small amount of white noise to input image

Then we successively backpropagate this loss through the network with the CNN weights fixed in order to update the pixels of

As of 2017[update], when implemented on a GPU, it takes a few minutes to converge.

[8] In some practical implementations, it is noted that the resulting image has too much high-frequency artifact, which can be suppressed by adding the total variation to the total loss.

[9] Compared to VGGNet, AlexNet does not work well for neural style transfer.

[11] Subsequent work improved the speed of NST for images by using special-purpose normalizations.

[12][8] In a paper by Fei-Fei Li et al. adopted a different regularized loss metric and accelerated method for training to produce results in real-time (three orders of magnitude faster than Gatys).

Training uses a similar loss function to the basic NST method but also regularizes the output for smoothness using a total variation (TV) loss.

[13] In a work by Chen Dongdong et al. they explored the fusion of optical flow information into feedforward networks in order to improve the temporal coherence of the output.

[14] Most recently, feature transform based NST methods have been explored for fast stylization that are not coupled to single specific style and enable user-controllable blending of styles, for example the whitening and coloring transform (WCT).