ArcFace, also known as Additive Angular Margin Loss, is a deep learning loss function originally designed for face recognition tasks. It has been widely adopted and proven to be highly effective in enhancing the discriminative power of face features. In this blog post, we'll explore the concept behind the ArcFace loss function and its usage in deep learning.

What is ArcFace Loss Function?

ArcFace loss function was introduced by Deng et al. in their paper "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" (2018). The main idea behind ArcFace is to add an angular margin to the logit values before passing them through a softmax layer. The margin helps to increase the inter-class variance and decrease the intra-class variance, resulting in better feature separability and hence improved face recognition accuracy.

ArcFace loss function can be defined as:

$Loss = -log*\dfrac{exp(S * cos(theta + m)}{exp(S * cos(theta + m) + sum(exp(S * cos(theta_j)))}$

where S is the scaling factor, m is the angular margin, theta is the angle between the feature and the weight vector of the ground truth class, and theta_j is the angle between the feature and the weight vector of all other classes.

How to calculate Arcface

ArcFace loss function can be used for variety of different tasks, where the goal is to identify the class in an image.

To use the ArcFace loss function the steps involve:

Normalize the embeddings (normally a Linear layer) and weights
Calculate the dot product (i.e cosine similarity)
Calculate the angles with arccos
Add a constant factor m to the angle corresponding to the ground truth label
Convert angles back to cosines
Use cross entropy on the adjusted cosine values to calculate loss

The ArcFace loss is calculated between the predicted class and the ground truth class and used to update the model weights during training.

It's usage in Deep Learning

Aside from face recognition, ArcFace has been applied to various other computer vision tasks, including object classification, person re-identification, and landmark detection. It has proven to be effective in these tasks by improving the discriminative ability of the deep neural networks.

Implementation in PyTorch

In the next section we will implement the ArcFace loss function in PyTorch and compae this against using standard cross entropy.

Import the relevant packages

from tqdm import tqdm

import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
from torch.utils.data import dataset, DataLoader

To start off we need to set some basic variables (you can modify these). For this tutorial we will be using the MNIST dataset which is used to build models to predict handwritten numbers based on images. Since we know the number of class we can set the output_classes to 10. we can setup an emb_size of 3 so we can visualise this later.

emb_size = 3
output_classes = 10

We now need to create the base model which is essentially a basic Convolutional Neural Network (CNN) which accepts a final_layer argument, that we can use to test the different loss function and outputs later.

# a very basic network, just four strided convolutions, batch norm and ReLu
class SimpleCNN(nn.Module):
    def __init__(self, final_layer):
        super(SimpleCNN, self).__init__()
        self.base_layers = nn.Sequential(
            nn.Conv2d(1, 32, 3, 1),
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, 1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Dropout(0.25),
            nn.Flatten(),
            nn.Linear(9216, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, emb_size)
        )
        self.final_layer = final_layer

    def get_base_layers(self, x):
        return self.base_layers(x)

    def forward(self, x):
        x = self.get_base_layers(x)
        
        output = self.final_layer(x)
        return output

As mentioned previously, the arcface loss is used in calculating the cosine similarity of objects within a dataset. We can create a custom CosineComponent which we can add to our base model later.

class CosineComponent(nn.Module):
    
    def __init__(self, emb_size, output_classes):
        super().__init__()
        self.W = nn.Parameter(torch.Tensor(emb_size, output_classes))
        nn.init.kaiming_uniform_(self.W)

    def forward(self, x):
        # Step 1:
        x_norm = F.normalize(x)
        W_norm = F.normalize(self.W, dim=0)
        # Step 2:
        return x_norm @ W_norm

Now the main focus of this blog is to understand the Arcface loss, this next function will take the output of the cosine values and modify the output based on a set margin of 0.4 (again you can change this). As you can see, we effectively take the output and transform it into the angle components and then pass it through the standard F.cross_entropy loss function.

def arcface_loss(cosine, target, m=.4):
    # this prevents nan when a value slightly crosses 1.0 due to numerical error
    cosine = cosine.clip(-1+1e-7, 1-1e-7) 
    # Step 3:
    arcosine = cosine.arccos()
    # Step 4:
    arcosine += F.one_hot(target, num_classes = output_classes) * m
    # Step 5:
    cosine2 = arcosine.cos()
    # Step 6:
    return F.cross_entropy(cosine2, target)

Now that we have the model parts created, we need to setup up training and testing functions. These functions allows us to interchange the loss_func.

def train(model, device, train_loader, optimizer, loss_func, epoch,log_interval=10, dry_run=None):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        
        loss = loss_func(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if dry_run:
                break

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

We also, need to create functions to extract the embeddings from the model and display it as a nice visualisation.

# helper method to extract all embedings from a data loader
def get_embs(model, dl, device):
    embs = []
    ys = []
    for bx,by in tqdm(dl):
        bx, by = bx.to(device), by.to(device)
        with torch.no_grad():
            embs.append(model.get_base_layers(bx))
            ys.append(by)
    embs = torch.cat(embs)
    embs = embs / embs.norm(p=2,dim=1)[:,None]
    ys = torch.cat(ys)
    return embs, ys

# helper to plot embeddings in 3D
def plot_embs(embs, ys, ax):
    #ax.axis('off')
    for k in range(10):
        e = embs[ys==k].cpu()
        ax.scatter(e[:,0], e[:,1], e[:,2], s=4, alpha=.2)

It's time to extract the data for MNIST and create the train_loader and test_loader.


train_kwargs = {'batch_size': 256}
test_kwargs = {'batch_size': 256}

transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])
    
dataset1 = datasets.MNIST('../data', train=True, download=True,
                    transform=transform)
dataset2 = datasets.MNIST('../data', train=False,
                    transform=transform)
train_loader = DataLoader(dataset1,**train_kwargs)
test_loader = DataLoader(dataset2, **test_kwargs)

We can then create the Model using including the Cosine Component.

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

model = SimpleCNN(CosineComponent(3, 10))
model.to(device)

optimizer = optim.Adadelta(model.parameters(), lr=1.0)

scheduler = StepLR(optimizer, step_size=1, gamma=0.7)

When we train the model we will pass in the ```arcface_loss`` function.

for epoch in range(1, 10 + 1):
    train(model, device, train_loader, optimizer, arcface_loss, epoch)
    test(model, device, test_loader)
    scheduler.step()

Finally, we can extract the embeddings based on the trained model.

embs_arcface, ys_arcface  = get_embs(model.eval(), test_loader, device)

Now we need to repeat the same except instead of using the cosine component we can pass a basic nn.Linear as the final layer to the Model. We will also use the standard nn.CrossEntropyLoss function instead of the Arcface loss. Note, pytorch includes the softmax function.

model_conv = SimpleCNN(nn.Linear(3,10))
model_conv.to(device)

optimizer_conv = optim.Adadelta(model_conv.parameters(), lr=1.0)
loss = nn.CrossEntropyLoss()

scheduler_conv = StepLR(optimizer_conv, step_size=1, gamma=0.7)

for epoch in range(1, 10 + 1):
    train(model_conv, device, train_loader, optimizer_conv, loss, epoch)
    test(model_conv, device, test_loader)
    scheduler_conv.step()

Again we can extract the embeddings from the traditional model.

embs_softmax, ys_softmax  = get_embs(model_conv, test_loader, device)

As we have both the embeddings from the two models we can compare the embeddings in 3d plot side by side. Here we can see that in the left plot which is using the output from arcface loss function the distribution of the examples are closer together than comparing against the traditional model using the cross entropy lass.

_,(ax1,ax2)=plt.subplots(1,2, figsize=(20,10), subplot_kw={'projection':'3d'})
plot_embs(embs_arcface, ys_arcface, ax1)
plot_embs(embs_softmax, ys_softmax, ax2)

plt.show()

This was also explained in the Arcface paper as the boundaries between each of the classes are more spread apart compared to Softmax Cross Entropy.

Conclusion

ArcFace loss function is a powerful and widely used loss function for face recognition tasks. By adding an angular margin to the logit values before passing them through the softmax layer, it helps to improve the discriminative power of face features,

References

ArcFace: Additive Angular Margin Loss for Deep Face Recognition - https://arxiv.org/abs/1801.07698
Example arcface code - https://www.kaggle.com/code/slawekbiel/arcface-explained