Deep Fusion - Neural Style Transfer  

Neural Style Transfer (NST) is a fascinating technique in the field of computer vision and machine learning, which merges two images into one. The process involves taking the content of one image and combining it with the style of another, usually creating striking and artistically compelling results. This technique leverages deep neural networks, and one of the most popular networks used for this purpose is VGG-16. Developed by the Visual Graphics Group at Oxford (hence "VGG"), the VGG-16 model is renowned for its effectiveness in image recognition tasks.

The core principle of NST using VGG-16 lies in understanding and separating the 'content' and 'style' features of an image. Content features are generally associated with the higher-level shapes and objects in an image, while style features are more about the texture and visual patterns. VGG-16, with its deep architecture of 16 layers, is adept at extracting these intricate features. It analyzes the input images through its multiple layers, each layer capturing different levels of abstraction and complexity in the images.

The NST process starts by feeding both the content image (the image whose content you want to keep) and the style image (the image whose style you want to apply) into the VGG-16 network. The network processes these images separately and extracts their respective features. The next step involves the recombination of these features. This is done by optimizing a new image to have the content features of the content image and the style features of the style image. The optimization is usually carried out using backpropagation, where the new image is iteratively updated to minimize a loss function defined by the differences in content and style with the original images.

To illustrate, imagine using a photograph of a cityscape as the content image and a famous painting like Van Gogh's "Starry Night" as the style image. The resulting image would maintain the recognizable structures of the cityscape, but with the swirling, dream-like textures and color palette of "Starry Night". This demonstrates the unique capability of NST using VGG-16: to create novel, visually engaging images that blend the boundaries between photography and traditional art forms. Let's generate a couple of illustrative images to showcase this concept.


 The Data 

Content Images 

The pictures I took.

Style Images

Paintings

Michigan Ave, Chicago, IL

Henri Rousseau 66th Painting

Ouray, Colorado

Claude Monet 42nd Painting

Statue of Liberty, New York

William Turner 42nd Painting

 The Approach 

I'm using the VGG19 model in my style transfer project. To capture the essence of the photos I'm working with, I've chosen certain layers from the network. I have selected the 'block5_conv2' layer for content representation because of its deep network positioning, which allows it to capture high-level content properties. In order to extract a rich representation of the style at various scales, I am accessing the first convolutional layer of each of the five blocks, from "block1_conv1" to "block5_conv1." I've calculated the lengths of the corresponding lists to create two variables, number_content and number_style, to help me keep track of how many layers I'm using for each aspect. These are going to play a big part in the style transfer process since they will probably affect the loss functions I have to tune.

Define content loss

Content loss is a crucial component in the process of neural style transfer, acting as a measure of how much the feature map of a generated image differs from the feature map of the content image. In the context of a Convolutional Neural Network (CNN) like VGG19, when an image is passed through the network, various layers capture different aspects of the image's content, such as edges, textures, and high-level features. By selecting a particular layer (or layers) to define the content representation, we can extract feature maps that represent the content of both the content image and the generated image at that layer's level of abstraction.

By minimizing the content loss during the training process, the generated image is adjusted such that its content representations become increasingly similar to the content image's representations. This is crucial for ensuring that, although the style of the image may change to match the style image, the core content—the arrangement of objects and their features—remains consistent with the original content image. The goal is to preserve the essence and structure of the content image while imbuing it with the artistic style of another image, achieving a balance where the content is recognizable, but the overall appearance is transformed.


Define style loss

The goal is to compute a style matrix for the generated image and the style image. Then the style loss is defined as the root mean square difference between the two style matrices. Style information is measured as the amount of correlation present between features maps in a given layer. Next, a loss is defined as the difference of correlation present between the feature maps computed by the generated image and the style image. The gram matrix is used to find the correlation between the feature maps of a convolution layer.


This correlation is quantified using a construct known as the Gram matrix. For a given layer in the CNN, the Gram matrix is computed by flattening the feature maps and then taking the outer product of this vector with itself. The resulting matrix captures the correlation of activations across different feature maps in the layer. The intuition behind using the Gram matrix is that features that activate together in the style image should also activate together in the generated image. The Gram matrix is calculated for both the generated image and the style reference image, yielding two separate matrices that represent the underlying style features of each.

Training Example

Observation of Style Transfer over Epochs

Style-Transfer-Training.mov

Final Output