Arwork Description Generation 

 The Architecture 

Vit-GPT2-Image-Captioning Model


The architecture depicted in the image appears to be a schematic representation of a Vision Transformer (ViT) model, which is an adaptation of the transformer architecture originally designed for natural language processing tasks, to computer vision. The diagram shows the process of converting input images into a series of patches and then processing these patches through a transformer model.

In the first stage of the architecture, an input image is divided into fixed-size patches. These patches are then flattened and linearly transformed into a sequence of lower-dimensional embeddings. An additional learnable embedding, often referred to as the "class" token, is appended to this sequence. Positional embeddings are added to the patch embeddings to retain the positional information, as transformers by themselves do not have any notion of order or sequence. This is crucial for the model to understand the arrangement of the patches and thus the spatial structure of the image.

The sequence of patch embeddings then enters the transformer encoder, which consists of a stack of identical layers. Each layer has two main components: multi-head self-attention and a simple feedforward neural network. Self-attention allows the model to weigh the importance of different patches relative to one another, which is essential for understanding the image contextually. The feedforward network applies further transformations to the data.

The final stage involves the transformer decoder, which typically would be used for generating an output sequence in tasks like image captioning or object detection. It includes components like masked self-attention, which prevents the model from seeing future tokens in a sequence, and encoder-decoder attention, which allows the decoder to focus on different parts of the input sequence. The feedforward neural network here further processes the data, and the output embeddings are then used to generate predictions or textual descriptions, as indicated by the "Text-caption" label at the bottom of the diagram. This architecture is highly flexible and has been groundbreaking for various applications in computer vision due to its ability to model complex dependencies and learn global representations of the input data.

 The Data 

Fine Tuning an Hugging Face Transformer Model 


Data Template for Fine Tuning

From the 51 Artists, short down to 11 top artists from which we select 3 artists for fine tuning. The model itself is computationally heavy hence the limitation in the number of classes we choose.

Since we already have the Images, Genre, Path to Image the only thing remaining are the Descriptions which are unknown. We use the transformer pipeline inference of the same model to mine the descriptions iteratively for the image. 

Mining Descriptions

This code snippet utilizes the Hugging Face Transformers library to create a high-level helper using a pre-trained model for image captioning. The code imports the necessary modules, including the pipeline class from the Transformers library and the filterwarnings function to suppress warnings. It then initializes an image-to-text pipeline using a pre-trained model called "nlpconnect/vit-gpt2-image-captioning." This pipeline is designed to take images as input and generate textual descriptions of those images.

The code then iterates through a dictionary called IMAGES_LIST, which presumably contains artists' names as keys and lists of image filenames as values. For each artist, it processes their images one by one. It constructs the file path for each image using the artist's name and the image filename and passes the image through the pipeline. The generated textual description for each image is stored in the IMG_DESC dictionary, with the image filename as the key.


→ Descriptions  

Final List of Dictionaries  

Training

Model in Action

Note: The saved model state was corrupted and the training to substantial amount of resources and time. Hence, it was not retrained with the data at hand. The application displays the usecase with the pre-trained hugging face pipeline and it out of the gate works better than many other models.