CLIP (Contrastive Language Image Pre-training) next generation of large scale pre-training

The CLIP (Contrastive Language-Image Pre-training) model is a recent breakthrough in the field of deep learning that enables understanding the relationship between language and images. Developed by OpenAI, the model uses a large-scale pre-training process (400 million image and text pairs) to learn a general understanding of how language and images relate to each other. The result is a model that can accurately generate text descriptions of images and understand natural language descriptions of images.

In this blog post, we will provide an overview of the CLIP model and demonstrate how to use it with Python code. Specifically, we will use the Hugging Face Transformers library to load a pre-trained CLIP model and generate text descriptions of images.

Overview of the CLIP Model

The CLIP model is a neural network that is trained on a diverse set of image and text data using a contrastive learning approach. Contrastive learning is a method that trains a neural network to distinguish between matching and non-matching pairs of text and images. This approach helps the model to identify the features and characteristics of both the text and image components and to generate meaningful associations between them.

The CLIP model consists of a vision encoder and a language encoder. The vision encoder is a Vision Transformer (ViT) that processes the image input, while the language encoder is a Text Transformer-based model that processes the text input. These two encoders are then combined using a cross-attention mechanism that allows the model to generate text descriptions of images and understand natural language descriptions of images.

Using the CLIP Model with Python Code

To use the CLIP model with Python code, we will be using the Hugging Face Transformers library. This library provides an easy-to-use interface for loading pre-trained models and generating text descriptions of images.

First, we need to install the Hugging Face Transformers library. We can do this using pip:

!pip install transformers

Next, we can load a pre-trained CLIP model using the CLIPModel class from the Hugging Face Transformers library:

from transformers import CLIPModel, CLIPProcessor

model_name = 'openai/clip-vit-base-patch32'
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

Here, we are loading the openai/clip-vit-base-patch32 pre-trained model. This model uses the ViT (Vision Transformer) architecture and has 32x32 image patches.

import requests
from PIL import Image
from io import BytesIO

# Load an image from a URL
url = 'http://farm4.staticflickr.com/3152/3558947512_1a2a5af50a_z.jpg'

response = requests.get(url)
img = Image.open(BytesIO(response.content))

We are using the requests library to load an image from a URL and the PIL library to open the image. We are then using the CLIPProcessor class to preprocess the image and the CLIPModel class to encode the image.

Next, we can use the get_image_features method from the CLIPProcessor class to encode an image:

# Preprocess the image
inputs = processor(images=img, return_tensors="pt")

# Encode the image
image_features = model.get_image_features(**inputs)

print(image_features.shape)

Similarly, we can encode a text description using the get_text_features method:

# Get the text features
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)

print(text_features.shape)

Now that we have encoded the image and text description, we can use the similarity method of the CLIPModel class to compute the similarity score between the two:

similarity_score = (text_features @ image_features.T).max().item()

print(f"The similarity score between the image and text is {similarity_score:.2f}.")

Here, we are using matrix multiplication to compute the similarity score between the encoded image and text features. The max method is used to obtain the highest similarity score, and the item method is used to convert the tensor to a Python float.

The strategy behind this is what is called Zero-shot Image Classification and can be used as alternate to traditionally training a model in supervised way, requiring no need for training data.

(Bonus) Open Clip by Laion.ai

The official CLIP model by Open.ai isn't the only model you can use. Laion the non-profit organisation released as massive 5 Billion multilingual dataset containing image-text pairs which has been used to create custom CLIP models which out perform the original CLIP model and can be used interchangeable with the original Open.ai models.

One of the recent models they have released is the ViT-G/14 (Big G) CLIP model trained on 2 Billion dataset which manages to achieve zero-shot accuracy of 80.3% on ImageNet dataset.

Conclusion

In this blog post, we provided an overview of the CLIP (Contrastive Language-Image Pre-training) model and demonstrated how to use it with Python code. Specifically, we used the Hugging Face Transformers library to load a pre-trained CLIP model and generate text descriptions of images. The CLIP model is a powerful tool for understanding the relationship between language and images, and it has many applications in the fields of computer vision, natural language processing, and artificial intelligence.

Overview of the CLIP Model

Using the CLIP Model with Python Code

(Bonus) Open Clip by Laion.ai

Conclusion

References