Object detection and a breakdown of all the different methods for segmentation of images

In this blog we will be looking at Object detection and what the different variations as well as some code examples. Object Detection, Semantic Segmentation, Instance segmentation, and Panoptic segmentation are all techniques used in computer vision to identify and label objects in an image. Each of these techniques has its own unique approach to identifying objects and can be used for a variety of applications. In this blog post, we will discuss each of these techniques in detail and provide code examples using the popular computer vision library, PyTorch.

Object Detection

Object detection is the process of detecting and localizing objects within an image. It involves identifying the presence of objects and drawing a bounding box around each object to indicate its location. Object detection is typically done using deep learning models, such as YOLO (You Only Look Once) and Faster R-CNN (Region-based Convolutional Neural Network).

To perform object detection using PyTorch, we can use the TorchVision library, which provides pre-trained models for object detection. The following code demonstrates how to perform object detection using the Faster R-CNN model:


import torch
import torchvision

from PIL import Image, ImageDraw
import requests
from io import BytesIO

# Load the Faster R-CNN model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# Load the input image
url = "http://farm1.staticflickr.com/28/54397364_f60be34ce1_z.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# Convert the image to a PyTorch tensor
input_tensor = torchvision.transforms.functional.to_tensor(image)

# Add a batch dimension to the tensor
input_tensor = input_tensor.unsqueeze(0)

# Run the input tensor through the model
model.eval()
with torch.no_grad():
    output = model(input_tensor)


# Get the predicted boxes, labels, and scores for each object in the image
boxes = output[0]['boxes']
labels = output[0]['labels']
scores = output[0]['scores']

# Draw the bounding boxes on the image
draw = ImageDraw.Draw(image)
for i in range(len(boxes)):
    box = boxes[i]
    label = labels[i]
    score = scores[i]
    if score > 0.5:
        draw.rectangle(box.tolist(), outline=(255,0,0), width=2)
        draw.text((box[0], box[1]), str(label.item()), fill=(255,0,0))
        
# Save the output image
image.save("output_image.jpg")

In this code, we first load the pre-trained Faster R-CNN model from TorchVision. We then load the input image from the url (play around with https://cocodataset.org/#explore for other examples) and convert it to a PyTorch tensor. We add a batch dimension to the tensor and run it through the model. The output of the model is a list of dictionaries, with each dictionary representing the objects detected in the image. We extract the predicted boxes, labels, and scores for each object and draw bounding boxes around them on the input image. Finally, we save the output image.

Semantic Segmentation

Semantic segmentation is the process of assigning a label to each pixel in an image based on the object category it belongs to. It involves dividing an image into multiple segments and assigning each segment a label. Semantic segmentation is commonly used in applications such as autonomous driving, where it is important to accurately identify the objects in the environment.

To perform semantic segmentation using PyTorch, we can use the DeepLabV3+ model from TorchVision. The following code demonstrates how to perform semantic segmentation using this model:

import torch
import torchvision

from PIL import Image, ImageDraw
import requests
from io import BytesIO

# Load the pre-trained DeepLabV3+ model
model = torchvision.models.segmentation.deeplabv3_resnet50(pretrained=True)

# Set the model to evaluation mode
model.eval()

# Load the input image
url = "http://farm9.staticflickr.com/8479/8212475146_43ea4f0216_z.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# Convert the image to a PyTorch tensor
input_tensor = torchvision.transforms.functional.to_tensor(image)

# Add a batch dimension to the tensor
input_tensor = input_tensor.unsqueeze(0)

# Run the input tensor through the model
with torch.no_grad():
    output = model(input_tensor)['out'][0]
output_predictions = output.argmax(0)

# Get the predicted segmentation mask and Map the labels to colors and overlay them on the image
palette = torch.tensor([2 ** 25 - 1, 2 ** 15 - 1, 2 ** 21 - 1])
colors = torch.as_tensor([i for i in range(21)])[:, None] * palette
colors = (colors % 255).numpy().astype("uint8")

r = Image.fromarray(output_predictions.byte().cpu().numpy()).resize(image.size)
r.putpalette(colors)

r.save("output_image.png")

In this code, we first load the pre-trained DeepLabV3+ model from TorchVision. We set the model to evaluation mode and load the input image. We convert the image to a PyTorch tensor, add a batch dimension to the tensor, and run it through the model. The output of the model is a dictionary, with the "out" key representing the predicted segmentation mask for the input image. We extract the predicted segmentation mask and map the labels to colors. Finally, we overlay the colored label map on the input image and save the output image.

Instance Segmentation

Instance segmentation is the process of detecting and segmenting each individual object within an image. It involves identifying the presence of objects and drawing a segmentation mask around each object to indicate its boundaries. Instance segmentation is commonly used in applications such as medical imaging and robotics.

To perform instance segmentation using PyTorch, we can use the Mask R-CNN model from TorchVision. The following code demonstrates how to perform instance segmentation using this model:

import torch
import torchvision

# Load the Mask R-CNN model
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)

# Load the input image
url = "http://farm3.staticflickr.com/2828/9397362228_dd3a4989bc_z.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# Convert the image to a PyTorch tensor
input_tensor = torchvision.transforms.functional.to_tensor(image)

# Add a batch dimension to the tensor
input_tensor = input_tensor.unsqueeze(0)

# Run the input tensor through the model
model.eval()
with torch.no_grad():
    output = model(input_tensor)

# Get the predicted boxes, labels, scores, and masks for each object in the image
boxes = output[0]['boxes']
labels = output[0]['labels']
scores = output[0]['scores']
masks = output[0]['masks']

# Draw the segmentation masks on the image
draw = ImageDraw.Draw(image)
for i in range(len(boxes)):
    box = boxes[i]
    label = labels[i]
    score = scores[i]
    mask = masks[i][0].detach().cpu().numpy()
    if score > 0.5:
        mask = (mask > 0.5)
        color = (np.random.randint(0, 255), np.random.randint(0, 255), np.random.randint(0, 255))
        draw.rectangle(box.tolist(), outline=color, width=2)
        draw.bitmap(box[:2], Image.fromarray((mask*255).astype('uint8')), fill=color)
        
# Save the output image
image.save("output_image.jpg")

In this code, we first load the pre-trained Mask R-CNN model from TorchVision. We then load the input image and convert it to a PyTorch tensor. We add a batch dimension to the tensor and run it through the model. The output of the model is a list of dictionaries, with each dictionary representing the objects detected in the image. We extract the predicted boxes, labels, scores, and masks for each object and draw segmentation masks around them on the input image. Finally, we save the output image.

Panoptic Segmentation

Panoptic segmentation is a recent addition to the computer vision field that aims to combine the benefits of object detection and semantic segmentation. It involves identifying and segmenting both objects and regions within an image, with each segment assigned a label. Panoptic segmentation is commonly used in applications such as video surveillance and urban planning.

To perform panoptic segmentation using PyTorch, we can use the Panoptic FPN model from TorchVision. The following code demonstrates how to perform panoptic segmentation using this model:

import torch
import torchvision
from PIL import Image, ImageDraw
import numpy as np

# Load the pre-trained Panoptic FPN model
model = torchvision.models.detection.panoptic_fpn_resnet50(pretrained=True).eval()

# Define a function to colorize the label map
def colorize(label_map, palette):
    """Colorize a label map using a color palette."""
    zero_pad = 256 * 3 - len(palette)
    for i in range(zero_pad):
        palette.append(0)
    palette[0:3] = [0, 0, 0]
    palette = np.array(palette, dtype='uint8').reshape((-1, 3))
    colored = np.zeros((label_map.shape[0], label_map.shape[1], 3), dtype=np.uint8)
    for label, color in enumerate(palette):
        colored[label_map == label, :] = color
    return Image.fromarray(colored)

# Load the input image
url = "http://farm3.staticflickr.com/2828/9397362228_dd3a4989bc_z.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# Convert the image to a PyTorch tensor
input_tensor = torchvision.transforms.functional.to_tensor(image)

# Add a batch dimension to the tensor
input_tensor = input_tensor.unsqueeze(0)

# Run the input tensor through the model
model.eval()
with torch.no_grad():
    output = model(input_tensor)

# Get the predicted panoptic segmentation mask
panoptic_seg = output[0]['panoptic_seg'][0].detach().cpu().numpy()

# Get the predicted segmentation masks for each object
seg_masks = output[0]['segments_info'][0]['mask']

# Draw the segmentation masks on the image
draw = ImageDraw.Draw(image)
for i in range(len(seg_masks)):
    mask = seg_masks[i].detach().cpu().numpy()
    color = (np.random.randint(0, 255), np.random.randint(0, 255), np.random.randint(0, 255))
    draw.bitmap((0, 0), Image.fromarray((mask*255).astype('uint8')), fill=color)

# Map the labels to colors and overlay them on the image
palette = output[0]['palette']
label_map = output[0]['panoptic_seg'][1].detach().cpu().numpy()
label_map_color = colorize(label_map, palette)
image = Image.alpha_composite(image.convert('RGBA'), label_map_color.convert('RGBA'))

# Save the output image
image.save("output_image.jpg")

In this code, we first load the pre-trained Panoptic FPN model from TorchVision. We define a function to colorize the label map, which will come in handy later. We load the input image, convert it to a PyTorch tensor, and add a batch dimension to the tensor. We run the input tensor through the model and get the predicted panoptic segmentation mask and the predicted segmentation masks for each object in the image. We draw the segmentation masks around each object and map the labels to colors. Finally, we overlay the colored label map on the input image and save the output image.

Conclusion

In conclusion, object detection, semantic segmentation, instance segmentation, and panoptic segmentation are important tasks in computer vision. These tasks enable us to build intelligent systems that can recognize and understand objects in images and videos. With the help of deep learning frameworks such as PyTorch and pre-trained models from TorchVision, we can perform these tasks with relative ease. By understanding these concepts and implementing them in code, we can build powerful computer vision systems that can perform a wide range of tasks, from autonomous driving to medical imaging.

Appendix

YOLO - https://arxiv.org/abs/1506.02640
Faster RCNN - https://arxiv.org/abs/1506.01497
DeepLabv3 - https://arxiv.org/abs/1706.05587
Panoptic Feature Pyramid Networks - https://arxiv.org/pdf/1901.02446v2.pdf