- Published on
Global Image Descriptors: From HOG to Deep Learning Features
- Authors
- Name
- Jared Chung
Global image descriptors are fundamental to computer vision, providing compact yet informative representations of entire images. These descriptors encode important visual characteristics and have been crucial for tasks like image retrieval, classification, and similarity matching.
In this comprehensive guide, we'll explore the evolution from traditional hand-crafted descriptors to modern deep learning-based global features.
What are Global Image Descriptors?
Global image descriptors capture characteristics of an entire image in a fixed-size vector representation. Unlike local descriptors that focus on specific keypoints, global descriptors summarize the overall visual content.
Key Properties
- Fixed dimensionality: Consistent vector size regardless of image size
- Translation/rotation invariance: Robust to geometric transformations
- Discriminative power: Ability to distinguish between different image classes
- Computational efficiency: Fast extraction and comparison
- Compact representation: Small memory footprint
import cv2
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from skimage import feature, filters, measure
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from typing import Tuple, List, Optional
Traditional Global Descriptors
1. Color Histograms
The simplest global descriptor based on color distribution:
class ColorHistogramDescriptor:
"""Extract color histogram features from images"""
def __init__(self, bins=(8, 8, 8), color_space='RGB'):
self.bins = bins
self.color_space = color_space
def extract(self, image):
"""Extract color histogram descriptor"""
# Convert color space if needed
if self.color_space == 'HSV':
image = cv2.cvtColor(image, cv2.COLOR_RGB2HSV)
elif self.color_space == 'LAB':
image = cv2.cvtColor(image, cv2.COLOR_RGB2LAB)
# Calculate histogram for each channel
hist = cv2.calcHist([image], [0, 1, 2], None, self.bins,
[0, 256, 0, 256, 0, 256])
# Normalize histogram
hist = cv2.normalize(hist, hist).flatten()
return hist
def compare(self, hist1, hist2, method='correlation'):
"""Compare two histograms"""
methods = {
'correlation': cv2.HISTCMP_CORREL,
'chi_square': cv2.HISTCMP_CHISQR,
'intersection': cv2.HISTCMP_INTERSECT,
'bhattacharyya': cv2.HISTCMP_BHATTACHARYYA
}
return cv2.compareHist(hist1, hist2, methods[method])
# Usage example
color_desc = ColorHistogramDescriptor(bins=(16, 16, 16), color_space='HSV')
# Extract features from image
image = cv2.imread('sample_image.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
hist_features = color_desc.extract(image)
print(f"Color histogram shape: {hist_features.shape}")
2. Histogram of Oriented Gradients (HOG)
HOG captures shape and structure through gradient orientations:
class HOGDescriptor:
"""Extract HOG (Histogram of Oriented Gradients) features"""
def __init__(self, orientations=9, pixels_per_cell=(8, 8),
cells_per_block=(2, 2), block_norm='L2-Hys'):
self.orientations = orientations
self.pixels_per_cell = pixels_per_cell
self.cells_per_block = cells_per_block
self.block_norm = block_norm
def extract(self, image):
"""Extract HOG descriptor from image"""
# Convert to grayscale if needed
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
# Extract HOG features
features = feature.hog(
gray,
orientations=self.orientations,
pixels_per_cell=self.pixels_per_cell,
cells_per_block=self.cells_per_block,
block_norm=self.block_norm,
visualize=False,
feature_vector=True
)
return features
def extract_with_visualization(self, image):
"""Extract HOG features with visualization"""
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
features, hog_image = feature.hog(
gray,
orientations=self.orientations,
pixels_per_cell=self.pixels_per_cell,
cells_per_block=self.cells_per_block,
block_norm=self.block_norm,
visualize=True,
feature_vector=True
)
return features, hog_image
# Visualize HOG features
def visualize_hog(image_path):
"""Visualize HOG features"""
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
hog_desc = HOGDescriptor()
features, hog_image = hog_desc.extract_with_visualization(image)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.imshow(image)
ax1.set_title('Original Image')
ax1.axis('off')
ax2.imshow(hog_image, cmap='gray')
ax2.set_title('HOG Features')
ax2.axis('off')
plt.tight_layout()
plt.show()
print(f"HOG feature vector length: {len(features)}")
return features
3. Local Binary Patterns (LBP)
LBP captures local texture patterns:
class LBPDescriptor:
"""Extract Local Binary Pattern features"""
def __init__(self, radius=3, n_points=None, method='uniform'):
self.radius = radius
self.n_points = n_points if n_points else 8 * radius
self.method = method
def extract(self, image):
"""Extract LBP descriptor"""
# Convert to grayscale
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
# Calculate LBP
lbp = feature.local_binary_pattern(
gray, self.n_points, self.radius, method=self.method
)
# Calculate histogram of LBP values
if self.method == 'uniform':
n_bins = self.n_points + 2
else:
n_bins = 2 ** self.n_points
hist, _ = np.histogram(lbp.ravel(), bins=n_bins,
range=(0, n_bins), density=True)
return hist
def extract_with_visualization(self, image):
"""Extract LBP with visualization"""
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
lbp = feature.local_binary_pattern(
gray, self.n_points, self.radius, method=self.method
)
# Calculate histogram
if self.method == 'uniform':
n_bins = self.n_points + 2
else:
n_bins = 2 ** self.n_points
hist, _ = np.histogram(lbp.ravel(), bins=n_bins,
range=(0, n_bins), density=True)
return hist, lbp
# Visualize LBP
def visualize_lbp(image_path):
"""Visualize LBP features"""
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
lbp_desc = LBPDescriptor(radius=3, method='uniform')
hist, lbp_image = lbp_desc.extract_with_visualization(image)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
ax1.imshow(image)
ax1.set_title('Original Image')
ax1.axis('off')
ax2.imshow(lbp_image, cmap='gray')
ax2.set_title('LBP Image')
ax2.axis('off')
ax3.plot(hist)
ax3.set_title('LBP Histogram')
ax3.set_xlabel('LBP Value')
ax3.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
return hist
4. Haralick Texture Features
Statistical texture features from gray-level co-occurrence matrices:
class HaralickDescriptor:
"""Extract Haralick texture features"""
def __init__(self, distances=[1], angles=[0, 45, 90, 135],
levels=256, symmetric=True, normed=True):
self.distances = distances
self.angles = np.radians(angles)
self.levels = levels
self.symmetric = symmetric
self.normed = normed
def extract(self, image):
"""Extract Haralick features"""
# Convert to grayscale and reduce levels
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
# Reduce gray levels for GLCM computation
gray = (gray // (256 // self.levels)).astype(np.uint8)
# Calculate Haralick features
features = []
for distance in self.distances:
for angle in self.angles:
# Calculate GLCM
glcm = feature.greycomatrix(
gray, [distance], [angle],
levels=self.levels,
symmetric=self.symmetric,
normed=self.normed
)
# Extract Haralick properties
contrast = feature.greycoprops(glcm, 'contrast')[0, 0]
dissimilarity = feature.greycoprops(glcm, 'dissimilarity')[0, 0]
homogeneity = feature.greycoprops(glcm, 'homogeneity')[0, 0]
energy = feature.greycoprops(glcm, 'energy')[0, 0]
correlation = feature.greycoprops(glcm, 'correlation')[0, 0]
features.extend([contrast, dissimilarity, homogeneity,
energy, correlation])
return np.array(features)
# Example usage
haralick_desc = HaralickDescriptor(distances=[1, 2], angles=[0, 45, 90, 135])
Modern Deep Learning Descriptors
1. CNN Feature Extraction
Using pre-trained CNN models for global features:
class CNNDescriptor:
"""Extract CNN-based global descriptors"""
def __init__(self, model_name='resnet50', layer='avgpool',
pretrained=True, device='cuda'):
self.device = device
self.model_name = model_name
self.layer = layer
# Load pre-trained model
if model_name == 'resnet50':
self.model = models.resnet50(pretrained=pretrained)
elif model_name == 'vgg16':
self.model = models.vgg16(pretrained=pretrained)
elif model_name == 'densenet121':
self.model = models.densenet121(pretrained=pretrained)
elif model_name == 'efficientnet_b0':
self.model = models.efficientnet_b0(pretrained=pretrained)
else:
raise ValueError(f"Unsupported model: {model_name}")
self.model = self.model.to(device)
self.model.eval()
# Set up feature extraction
self.features = None
self._register_hook()
# Image preprocessing
self.preprocess = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
def _register_hook(self):
"""Register forward hook to extract features"""
def hook(module, input, output):
self.features = output.detach()
# Find the target layer
for name, module in self.model.named_modules():
if name == self.layer:
module.register_forward_hook(hook)
break
def extract(self, image):
"""Extract CNN features from image"""
# Preprocess image
if isinstance(image, np.ndarray):
if len(image.shape) == 3 and image.shape[2] == 3:
# RGB image
input_tensor = self.preprocess(image)
else:
# Grayscale - convert to RGB
image_rgb = np.stack([image] * 3, axis=-1)
input_tensor = self.preprocess(image_rgb)
else:
input_tensor = self.preprocess(image)
input_batch = input_tensor.unsqueeze(0).to(self.device)
# Extract features
with torch.no_grad():
_ = self.model(input_batch)
# Global average pooling if needed
features = self.features.squeeze()
if len(features.shape) over 1:
features = torch.mean(features.view(features.size(0), -1), dim=1)
return features.cpu().numpy()
def extract_batch(self, images):
"""Extract features from batch of images"""
# Preprocess batch
batch_tensors = []
for image in images:
if isinstance(image, np.ndarray):
tensor = self.preprocess(image)
else:
tensor = self.preprocess(image)
batch_tensors.append(tensor)
input_batch = torch.stack(batch_tensors).to(self.device)
# Extract features
with torch.no_grad():
_ = self.model(input_batch)
# Process features
features = self.features
if len(features.shape) over 2:
features = torch.mean(features.view(features.size(0), features.size(1), -1), dim=2)
return features.cpu().numpy()
# Example usage
cnn_desc = CNNDescriptor(model_name='resnet50', layer='avgpool')
2. Self-Supervised Global Descriptors
Using self-supervised models for feature extraction:
class SelfSupervisedDescriptor:
"""Extract features using self-supervised models"""
def __init__(self, model_type='simclr', checkpoint_path=None, device='cuda'):
self.device = device
self.model_type = model_type
if model_type == 'simclr':
self.model = self._load_simclr_model(checkpoint_path)
elif model_type == 'swav':
self.model = self._load_swav_model(checkpoint_path)
elif model_type == 'dino':
self.model = self._load_dino_model(checkpoint_path)
else:
raise ValueError(f"Unsupported model type: {model_type}")
self.model = self.model.to(device)
self.model.eval()
# Preprocessing
self.preprocess = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
def _load_simclr_model(self, checkpoint_path):
"""Load SimCLR model"""
# Simplified SimCLR encoder
encoder = models.resnet50(pretrained=False)
encoder.fc = nn.Identity() # Remove classification head
if checkpoint_path:
checkpoint = torch.load(checkpoint_path, map_location='cpu')
encoder.load_state_dict(checkpoint['encoder'])
return encoder
def _load_swav_model(self, checkpoint_path):
"""Load SwAV model"""
# Load SwAV model (simplified)
encoder = models.resnet50(pretrained=False)
encoder.fc = nn.Identity()
if checkpoint_path:
checkpoint = torch.load(checkpoint_path, map_location='cpu')
encoder.load_state_dict(checkpoint['encoder'])
return encoder
def _load_dino_model(self, checkpoint_path):
"""Load DINO model"""
# For this example, use a ViT model
import timm
model = timm.create_model('vit_base_patch16_224', pretrained=True)
model.head = nn.Identity()
return model
def extract(self, image):
"""Extract features from single image"""
input_tensor = self.preprocess(image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model(input_tensor)
return features.squeeze().cpu().numpy()
def extract_batch(self, images):
"""Extract features from batch of images"""
batch_tensors = []
for image in images:
tensor = self.preprocess(image)
batch_tensors.append(tensor)
input_batch = torch.stack(batch_tensors).to(self.device)
with torch.no_grad():
features = self.model(input_batch)
return features.cpu().numpy()
3. Vision Transformer Global Features
Using Vision Transformers for global descriptors:
class ViTDescriptor:
"""Extract global features using Vision Transformers"""
def __init__(self, model_name='vit_base_patch16_224',
pretrained=True, device='cuda'):
self.device = device
import timm
self.model = timm.create_model(model_name, pretrained=pretrained)
# Remove classification head to get features
if hasattr(self.model, 'head'):
self.model.head = nn.Identity()
elif hasattr(self.model, 'classifier'):
self.model.classifier = nn.Identity()
self.model = self.model.to(device)
self.model.eval()
# Get input size
self.input_size = self.model.default_cfg['input_size'][1]
self.preprocess = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize(self.input_size),
transforms.CenterCrop(self.input_size),
transforms.ToTensor(),
transforms.Normalize(
mean=self.model.default_cfg['mean'],
std=self.model.default_cfg['std']
)
])
def extract(self, image):
"""Extract ViT features"""
input_tensor = self.preprocess(image).unsqueeze(0).to(self.device)
with torch.no_grad():
features = self.model(input_tensor)
return features.squeeze().cpu().numpy()
def extract_patch_features(self, image):
"""Extract patch-level features (not just CLS token)"""
input_tensor = self.preprocess(image).unsqueeze(0).to(self.device)
with torch.no_grad():
# Get patch embeddings before final pooling
x = self.model.patch_embed(input_tensor)
x = self.model._pos_embed(x)
x = self.model.norm_pre(x)
for block in self.model.blocks:
x = block(x)
x = self.model.norm(x)
# Return both CLS token and patch tokens
cls_token = x[:, 0] # CLS token
patch_tokens = x[:, 1:] # Patch tokens
return cls_token.cpu().numpy(), patch_tokens.cpu().numpy()
Descriptor Comparison and Evaluation
Performance Evaluation Framework
class DescriptorEvaluator:
"""Evaluate and compare different image descriptors"""
def __init__(self, descriptors, dataset_path):
self.descriptors = descriptors
self.dataset_path = dataset_path
self.features_cache = {}
def extract_features(self, descriptor_name, image_paths):
"""Extract features for all images using specified descriptor"""
if descriptor_name in self.features_cache:
return self.features_cache[descriptor_name]
descriptor = self.descriptors[descriptor_name]
features = []
print(f"Extracting {descriptor_name} features...")
for i, img_path in enumerate(image_paths):
if i % 100 == 0:
print(f"Processing {i}/{len(image_paths)}")
image = cv2.imread(img_path)
if image is None:
continue
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
try:
feature = descriptor.extract(image)
features.append(feature)
except Exception as e:
print(f"Error processing {img_path}: {e}")
continue
features = np.array(features)
self.features_cache[descriptor_name] = features
return features
def evaluate_retrieval(self, descriptor_name, query_features,
gallery_features, query_labels, gallery_labels, k=10):
"""Evaluate image retrieval performance"""
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
# Calculate similarities
similarities = cosine_similarity(query_features, gallery_features)
# Calculate metrics
precisions = []
recalls = []
for i, query_label in enumerate(query_labels):
# Get top-k similar images
sim_scores = similarities[i]
top_k_indices = np.argsort(sim_scores)[::-1][:k]
# Calculate precision and recall
retrieved_labels = gallery_labels[top_k_indices]
relevant_retrieved = np.sum(retrieved_labels == query_label)
total_relevant = np.sum(gallery_labels == query_label)
precision = relevant_retrieved / k if k over 0 else 0
recall = relevant_retrieved / total_relevant if total_relevant over 0 else 0
precisions.append(precision)
recalls.append(recall)
return {
'precision@k': np.mean(precisions),
'recall@k': np.mean(recalls),
'descriptor': descriptor_name
}
def evaluate_classification(self, descriptor_name, features, labels, test_size=0.3):
"""Evaluate classification performance"""
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(
features, labels, test_size=test_size, random_state=42, stratify=labels
)
# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train classifier
classifier = SVC(kernel='rbf', random_state=42)
classifier.fit(X_train_scaled, y_train)
# Predict
y_pred = classifier.predict(X_test_scaled)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
return {
'accuracy': accuracy,
'descriptor': descriptor_name,
'report': classification_report(y_test, y_pred, output_dict=True)
}
def compare_descriptors(self, image_paths, labels, evaluation_type='classification'):
"""Compare all descriptors"""
results = {}
for desc_name in self.descriptors.keys():
print(f"\nEvaluating {desc_name}...")
# Extract features
features = self.extract_features(desc_name, image_paths)
if evaluation_type == 'classification':
result = self.evaluate_classification(desc_name, features, labels)
elif evaluation_type == 'retrieval':
# For retrieval, split into query and gallery
split_idx = len(features) // 2
query_features = features[:split_idx]
gallery_features = features[split_idx:]
query_labels = labels[:split_idx]
gallery_labels = labels[split_idx:]
result = self.evaluate_retrieval(
desc_name, query_features, gallery_features,
query_labels, gallery_labels
)
results[desc_name] = result
return results
def visualize_results(self, results, metric='accuracy'):
"""Visualize comparison results"""
descriptors = list(results.keys())
scores = [results[desc][metric] for desc in descriptors]
plt.figure(figsize=(12, 6))
bars = plt.bar(descriptors, scores)
plt.title(f'Descriptor Comparison - {metric.title()}')
plt.ylabel(metric.title())
plt.xlabel('Descriptor')
plt.xticks(rotation=45)
# Add value labels on bars
for bar, score in zip(bars, scores):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
return plt.gcf()
Advanced Descriptor Techniques
1. Bag of Visual Words (BoVW)
class BagOfVisualWords:
"""Bag of Visual Words descriptor using local features"""
def __init__(self, vocab_size=500, detector_type='sift'):
self.vocab_size = vocab_size
self.detector_type = detector_type
self.vocabulary = None
self.kmeans = None
# Initialize feature detector
if detector_type == 'sift':
self.detector = cv2.SIFT_create()
elif detector_type == 'orb':
self.detector = cv2.ORB_create()
else:
raise ValueError(f"Unsupported detector: {detector_type}")
def build_vocabulary(self, image_paths, max_images=1000):
"""Build visual vocabulary from training images"""
print("Building visual vocabulary...")
all_descriptors = []
# Extract descriptors from training images
for i, img_path in enumerate(image_paths[:max_images]):
if i % 100 == 0:
print(f"Processing {i}/{min(len(image_paths), max_images)}")
image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
if image is None:
continue
# Detect keypoints and extract descriptors
keypoints, descriptors = self.detector.detectAndCompute(image, None)
if descriptors is not None:
all_descriptors.append(descriptors)
# Combine all descriptors
all_descriptors = np.vstack(all_descriptors)
# Cluster descriptors to create vocabulary
print(f"Clustering {len(all_descriptors)} descriptors into {self.vocab_size} words...")
self.kmeans = KMeans(n_clusters=self.vocab_size, random_state=42, n_init=10)
self.kmeans.fit(all_descriptors)
self.vocabulary = self.kmeans.cluster_centers_
print("Vocabulary built successfully!")
def extract(self, image):
"""Extract BoVW descriptor from image"""
if self.vocabulary is None:
raise ValueError("Vocabulary not built. Call build_vocabulary() first.")
# Convert to grayscale if needed
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
# Detect keypoints and extract descriptors
keypoints, descriptors = self.detector.detectAndCompute(gray, None)
if descriptors is None:
return np.zeros(self.vocab_size)
# Assign descriptors to vocabulary words
words = self.kmeans.predict(descriptors)
# Create histogram of word occurrences
hist, _ = np.histogram(words, bins=self.vocab_size, range=(0, self.vocab_size))
# Normalize histogram
hist = hist.astype(np.float32)
if np.sum(hist) over 0:
hist = hist / np.sum(hist)
return hist
# Usage example
bovw = BagOfVisualWords(vocab_size=500, detector_type='sift')
2. Fisher Vector Encoding
class FisherVectorDescriptor:
"""Fisher Vector encoding of local features"""
def __init__(self, n_components=64, detector_type='sift'):
self.n_components = n_components
self.detector_type = detector_type
self.gmm = None
# Initialize feature detector
if detector_type == 'sift':
self.detector = cv2.SIFT_create()
elif detector_type == 'surf':
self.detector = cv2.xfeatures2d.SURF_create()
else:
raise ValueError(f"Unsupported detector: {detector_type}")
def fit_gmm(self, image_paths, max_images=1000):
"""Fit Gaussian Mixture Model on training descriptors"""
from sklearn.mixture import GaussianMixture
print("Fitting GMM for Fisher Vector...")
all_descriptors = []
# Extract descriptors from training images
for i, img_path in enumerate(image_paths[:max_images]):
if i % 100 == 0:
print(f"Processing {i}/{min(len(image_paths), max_images)}")
image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
if image is None:
continue
keypoints, descriptors = self.detector.detectAndCompute(image, None)
if descriptors is not None:
all_descriptors.append(descriptors)
# Combine all descriptors
all_descriptors = np.vstack(all_descriptors)
# Fit GMM
print(f"Fitting GMM with {self.n_components} components...")
self.gmm = GaussianMixture(n_components=self.n_components, random_state=42)
self.gmm.fit(all_descriptors)
print("GMM fitted successfully!")
def extract(self, image):
"""Extract Fisher Vector from image"""
if self.gmm is None:
raise ValueError("GMM not fitted. Call fit_gmm() first.")
# Convert to grayscale if needed
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
# Extract local descriptors
keypoints, descriptors = self.detector.detectAndCompute(gray, None)
if descriptors is None:
return np.zeros(2 * self.n_components * descriptors.shape[1])
# Compute Fisher Vector
fisher_vector = self._compute_fisher_vector(descriptors)
return fisher_vector
def _compute_fisher_vector(self, descriptors):
"""Compute Fisher Vector encoding"""
# Get GMM parameters
means = self.gmm.means_
covariances = self.gmm.covariances_
weights = self.gmm.weights_
# Compute soft assignments
posteriors = self.gmm.predict_proba(descriptors)
# Initialize Fisher Vector
d = descriptors.shape[1]
fv = np.zeros(2 * self.n_components * d)
# Compute Fisher Vector components
for k in range(self.n_components):
# Deviation from mean
diff = descriptors - means[k]
# Weighted deviations
weighted_diff = posteriors[:, k:k+1] * diff
# First order statistics (gradient w.r.t. mean)
first_order = np.sum(weighted_diff, axis=0) / np.sqrt(weights[k])
# Second order statistics (gradient w.r.t. variance)
second_order = np.sum(posteriors[:, k:k+1] *
(diff**2 / covariances[k] - 1), axis=0) / np.sqrt(2 * weights[k])
# Store in Fisher Vector
fv[k*d:(k+1)*d] = first_order
fv[(self.n_components + k)*d:(self.n_components + k + 1)*d] = second_order
# Power normalization
fv = np.sign(fv) * np.sqrt(np.abs(fv))
# L2 normalization
fv = fv / np.linalg.norm(fv)
return fv
3. VLAD (Vector of Locally Aggregated Descriptors)
class VLADDescriptor:
"""VLAD (Vector of Locally Aggregated Descriptors) encoding"""
def __init__(self, n_clusters=64, detector_type='sift'):
self.n_clusters = n_clusters
self.detector_type = detector_type
self.kmeans = None
# Initialize feature detector
if detector_type == 'sift':
self.detector = cv2.SIFT_create()
elif detector_type == 'orb':
self.detector = cv2.ORB_create()
else:
raise ValueError(f"Unsupported detector: {detector_type}")
def build_codebook(self, image_paths, max_images=1000):
"""Build visual codebook for VLAD"""
print("Building VLAD codebook...")
all_descriptors = []
# Extract descriptors from training images
for i, img_path in enumerate(image_paths[:max_images]):
if i % 100 == 0:
print(f"Processing {i}/{min(len(image_paths), max_images)}")
image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
if image is None:
continue
keypoints, descriptors = self.detector.detectAndCompute(image, None)
if descriptors is not None:
all_descriptors.append(descriptors)
# Combine all descriptors
all_descriptors = np.vstack(all_descriptors)
# Build codebook using K-means
print(f"Clustering into {self.n_clusters} visual words...")
self.kmeans = KMeans(n_clusters=self.n_clusters, random_state=42, n_init=10)
self.kmeans.fit(all_descriptors)
print("Codebook built successfully!")
def extract(self, image):
"""Extract VLAD descriptor from image"""
if self.kmeans is None:
raise ValueError("Codebook not built. Call build_codebook() first.")
# Convert to grayscale if needed
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
else:
gray = image
# Extract local descriptors
keypoints, descriptors = self.detector.detectAndCompute(gray, None)
if descriptors is None:
return np.zeros(self.n_clusters * descriptors.shape[1])
# Compute VLAD encoding
vlad_vector = self._compute_vlad(descriptors)
return vlad_vector
def _compute_vlad(self, descriptors):
"""Compute VLAD encoding"""
# Assign descriptors to nearest cluster centers
cluster_assignments = self.kmeans.predict(descriptors)
# Get cluster centers
centers = self.kmeans.cluster_centers_
# Initialize VLAD vector
vlad = np.zeros((self.n_clusters, descriptors.shape[1]))
# Accumulate residuals for each cluster
for i in range(self.n_clusters):
# Find descriptors assigned to cluster i
cluster_mask = cluster_assignments == i
if np.any(cluster_mask):
# Compute residuals (descriptor - cluster center)
residuals = descriptors[cluster_mask] - centers[i]
# Sum residuals
vlad[i] = np.sum(residuals, axis=0)
# Flatten VLAD vector
vlad = vlad.flatten()
# Power normalization
vlad = np.sign(vlad) * np.sqrt(np.abs(vlad))
# L2 normalization
vlad = vlad / np.linalg.norm(vlad)
return vlad
Practical Applications
Image Retrieval System
class ImageRetrievalSystem:
"""Complete image retrieval system using global descriptors"""
def __init__(self, descriptor, similarity_metric='cosine'):
self.descriptor = descriptor
self.similarity_metric = similarity_metric
self.database_features = None
self.database_paths = None
def build_database(self, image_paths):
"""Build feature database"""
print("Building image database...")
features = []
valid_paths = []
for i, img_path in enumerate(image_paths):
if i % 100 == 0:
print(f"Processing {i}/{len(image_paths)}")
try:
image = cv2.imread(img_path)
if image is None:
continue
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
feature = self.descriptor.extract(image)
features.append(feature)
valid_paths.append(img_path)
except Exception as e:
print(f"Error processing {img_path}: {e}")
continue
self.database_features = np.array(features)
self.database_paths = valid_paths
print(f"Database built with {len(valid_paths)} images")
def search(self, query_image_path, top_k=10):
"""Search for similar images"""
if self.database_features is None:
raise ValueError("Database not built. Call build_database() first.")
# Extract query features
query_image = cv2.imread(query_image_path)
query_image = cv2.cvtColor(query_image, cv2.COLOR_BGR2RGB)
query_features = self.descriptor.extract(query_image)
# Compute similarities
if self.similarity_metric == 'cosine':
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_features], self.database_features)[0]
elif self.similarity_metric == 'euclidean':
from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances([query_features], self.database_features)[0]
similarities = 1 / (1 + distances) # Convert to similarity
# Get top-k results
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
'path': self.database_paths[idx],
'similarity': similarities[idx]
})
return results
def visualize_results(self, query_path, results, max_display=5):
"""Visualize search results"""
fig, axes = plt.subplots(1, min(len(results) + 1, max_display + 1),
figsize=(20, 4))
# Display query image
query_img = cv2.imread(query_path)
query_img = cv2.cvtColor(query_img, cv2.COLOR_BGR2RGB)
axes[0].imshow(query_img)
axes[0].set_title('Query')
axes[0].axis('off')
# Display top results
for i, result in enumerate(results[:max_display]):
if i + 1 >= len(axes):
break
result_img = cv2.imread(result['path'])
result_img = cv2.cvtColor(result_img, cv2.COLOR_BGR2RGB)
axes[i + 1].imshow(result_img)
axes[i + 1].set_title(f'Rank {i+1}\nSim: {result["similarity"]:.3f}')
axes[i + 1].axis('off')
plt.tight_layout()
plt.show()
# Example usage
def example_retrieval_system():
"""Example of using the image retrieval system"""
# Initialize descriptor
descriptor = CNNDescriptor(model_name='resnet50', layer='avgpool')
# Create retrieval system
retrieval_system = ImageRetrievalSystem(descriptor, similarity_metric='cosine')
# Build database (example paths)
database_paths = ['path/to/database/images/*.jpg'] # Replace with actual paths
retrieval_system.build_database(database_paths)
# Search for similar images
query_path = 'path/to/query/image.jpg' # Replace with actual path
results = retrieval_system.search(query_path, top_k=10)
# Visualize results
retrieval_system.visualize_results(query_path, results)
return results
Conclusion
Global image descriptors remain fundamental to computer vision, evolving from hand-crafted features to sophisticated deep learning representations. Key insights:
Traditional vs. Modern Descriptors
Traditional strengths:
- Interpretability: Clear understanding of what features represent
- Efficiency: Fast extraction and low memory requirements
- Robustness: Well-understood invariance properties
- Specialization: Tailored for specific visual properties
Deep learning advantages:
- Representation power: Learn complex, hierarchical features
- Transfer learning: Pre-trained models work across domains
- End-to-end optimization: Features optimized for specific tasks
- Scalability: Handle large-scale datasets effectively
Best Practices
- Task-specific selection: Choose descriptors based on application requirements
- Preprocessing importance: Proper image normalization and resizing
- Feature normalization: L2 normalization often improves performance
- Dimensionality considerations: Balance between discriminative power and efficiency
- Evaluation methodology: Use appropriate metrics for your specific task
Future Directions
- Self-supervised learning: More robust features without labeled data
- Multi-modal descriptors: Combining visual with textual information
- Efficient architectures: Lightweight models for mobile applications
- Domain adaptation: Descriptors that transfer across different domains
The choice of global descriptor depends on your specific requirements: computational constraints, accuracy needs, interpretability requirements, and available training data.
References
- Dalal, N., & Triggs, B. (2005). "Histograms of oriented gradients for human detection."
- Ojala, T., et al. (2002). "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns."
- Perronnin, F., & Dance, C. (2007). "Fisher kernels on visual vocabularies for image categorization."
- Jégou, H., et al. (2010). "Aggregating local descriptors into a compact image representation."
- Simonyan, K., & Zisserman, A. (2014). "Very deep convolutional networks for large-scale image recognition."