Fine-Tuning A Face Detection Network in PyTorch

What am I doing today?

I have installed PyTorch on my system and run the S3FD Face Detection code in PyTorch at SFD PyTorch. It works very well to detect faces at different scales. The aim of my experiment is to convert this face detection network into a face recognition or gender recognition network. Towards this end, we will look at different approaches.

Let’s get started with understanding the network structure of S3FD.

The Face Detection Network S3FD currently has this architecture:

S3FD Network

There are 4 different sections of layers in the above network. It is necessary to understand these well to make changes to the architecture. The network architecture is based on VGG16. There are Base Convolutional Layers, Extra Convolutional Layers where the FC layers of VGG are replaced by Conv layers.

The Detection Layers are particularly important. They select conv3 3, conv4 3, conv5 3, conv fc7, conv6 2 and conv7 2 as the detection layers, which are associated with different scales of anchor to predict detections. The output feature maps from these Detection Layers are fed to the Predicted Convolutional Layers.

The Predicted Convolutional Layers predicts 4 offsets relative to its coordinates and 2 scores for classification (Face / Not Face).

For more details, see the paper: S3FD ArXiv Paper

To see where this architecture is defined in code, go to:

Getting back to the s3fd architecture, once you’ve defined your layers – you need to tell the network what a forward pass through those layers will look like i.e. what is the input to each layer and what is the output and to which layer is this output fed as input and so on.

You started out with a 3 channel image, you require a 64 channel tensor. So you need 64 3 x 3 x 3 kernels altogether. So the kernel size is 64 x 3 x 3 x 3 (N x C x H x W). If this is not clear to you, you should re-visit the figure 1.3 in the above pdf file: A guide to Convolutional Arithmethics

This output h is passed again though Conv and Relu followed by a MaxPool operation. A MaxPool operation takes in the the input tensor N x C x H x W and reduces its size to N x C x H/2 x W/2 if the stride is 2 for both H and W dimensions. For odd dimensions, use integer multiplication.

Experiment 1

In our first experiment, we shall forget about the Predicted Convolution Layers, the Normalization Layer and the Detection Layers. We shall use our toy dataset to train the network to learn Facial Recognition for 10 classes (i.e. 10 people). To achieve this, we shall add a Fully Connected layer after the Conv7_2 layer. This FC Layer will take the output of Conv7_2 layer as input and give an output score for each one of the classes.

Let’s get into the code, shall we?

Let us change our network architecture file to reflect our modified network:

Here is the original file again:

And this is what our modified class s3fd looks like:

class s3fd(nn.Module):
    def __init__(self, num_classes):
        super(s3fd, self).__init__()
        self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1)

        self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)

        self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1)
        self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)
        self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)

        self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1)
        self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)
        self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)

        self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)
        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)
        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, stride=1, padding=1)

        self.fc6     = nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=3)
        self.fc7     = nn.Conv2d(1024, 1024, kernel_size=1, stride=1, padding=0)

        self.conv6_1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0)
        self.conv6_2 = nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1)

        self.conv7_1 = nn.Conv2d(512, 128, kernel_size=1, stride=1, padding=0)
        self.conv7_2 = nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1)

        self.fc_1 = nn.Linear(2304,num_classes)

Notice that we added a new Fully Connected layer which takes a 2304 dimensional input vector and outputs a 10 x 1 vector corresponding to our 10 classes. Also notice that we added the num_classes variable as input to the init function.

The Forward function also needs to be changed now. We need to flatten the N x 256 x H x W dimensional output from conv7_2 to N x 1 x 2304 for the FC layer input.

    def forward(self, x):
        h = F.relu(self.conv1_1(x))
        h = F.relu(self.conv1_2(h))
        h = F.max_pool2d(h, 2, 2)

        h = F.relu(self.conv2_1(h))
        h = F.relu(self.conv2_2(h))
        h = F.max_pool2d(h, 2, 2)

        h = F.relu(self.conv3_1(h))
        h = F.relu(self.conv3_2(h))
        h = F.relu(self.conv3_3(h)); f3_3 = h
        h = F.max_pool2d(h, 2, 2)

        h = F.relu(self.conv4_1(h))
        h = F.relu(self.conv4_2(h))
        h = F.relu(self.conv4_3(h)); f4_3 = h

        h = F.max_pool2d(h, 2, 2)

        h = F.relu(self.conv5_1(h))
        h = F.relu(self.conv5_2(h))
        h = F.relu(self.conv5_3(h)); f5_3 = h
        h = F.max_pool2d(h, 2, 2)

        h = F.relu(self.fc6(h))
        h = F.relu(self.fc7(h));     ffc7 = h
        h = F.relu(self.conv6_1(h))
        h = F.relu(self.conv6_2(h)); f6_2 = h
        h = F.relu(self.conv7_1(h))
        h = F.relu(self.conv7_2(h)); f7_2 = h
        #m = F.max_pool2d(h, 2, 2)
        m = f7_2.view(1,-1)
        op = self.fc_1(m)
        return F.softmax(op)

The f7_2.view(1, -1) function will take the input tensor and say - give me an output with only 1 row (Height) and the rest of the numbers as columns (Width). Then we will send this as input to our newly created FC layer. Finally, we will run a softmax function over to it to ensure the outputs are normalized over the layer and lie between 0-1 summing over to 1.

Great! we are done with network architecture changes. Let’s head over to our training code now.

Fire up your ipython notebook or your python file and let’s do a bunch of imports.

from __future__ import print_function

import torch
import torch.optim as optim
from import Dataset

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from import DataLoader
import matplotlib.pyplot as plt
from torch.autograd import Variable
torch.backends.cudnn.bencmark = True
import torchvision
import torchvision.transforms as transforms

import os,sys,cv2,random,datetime,time,math
import pandas as pd
import argparse
import numpy as np
from net_s3fd import *
from bbox import *
from sklearn.preprocessing import MultiLabelBinarizer
from PIL import Image

Now, since we have our custom toy dataset we need to create a class to help the torch dataloader to grab images from the custom dataset.

You can download the Toy dataset from NYU Classes (Resources Section) and extract it to a data/ directory in the current directory.

There is a train folder and an index.csv which has the names of the training files mapped under ‘image_names’ and the class it corresponds to under ‘tags’. Instead of turning this into a face recognition system, you can turn this into a gender recognition system by using the column ‘Gender’ to train and setting num_classes variable defined below to 2 instead of 10.

Note that our dataset is very small and only has 20 images per celebrity. This means that the network will not be able to generalize to a wide extent. Since we have very less data, we shall also try to keep the number of parameters to be optimized to very less. At the moment we have 2304*10 weights in the FC layer which need to be learnt (and biases).

class CelebDataset(Dataset):
    """Dataset wrapping images and target labels
        A CSV file path
        Path to image folder
        Extension of images
        PIL transforms

    def __init__(self, csv_path, img_path, img_ext, transform=None):
        tmp_df = pd.read_csv(csv_path)
        assert tmp_df['Image_Name'].apply(lambda x: os.path.isfile(img_path + x + img_ext)).all(), \
"Some images referenced in the CSV file were not found"
        = MultiLabelBinarizer()
        self.img_path = img_path
        self.img_ext = img_ext
        self.transform = transform

        self.X_train = tmp_df['Image_Name']
        self.y_train =['Gender'].str.split()).astype(np.float32)

    def __getitem__(self, index):
        img = cv2.imread(self.img_path + self.X_train[index] + self.img_ext)
        img = cv2.resize(img, (256,256))
        img = img - np.array([104,117,123])
        img = img.transpose(2, 0, 1)
        #img = img.reshape((1,)+img.shape)
        img = torch.from_numpy(img).float()
        #img = Variable(torch.from_numpy(img).float(),volatile=True)
        #if self.transform is not None:
        #    img = self.transform(img)
        label = torch.from_numpy(self.y_train[index])
        return img, label

    def __len__(self):
        return len(self.X_train.index)

We inherit Torch’s Dataset module and instantiate it. We read the index.csv file using Pandas and instantiate the MultiLabel Binarizer from SciKit Learn. This creates a binary array for multi-label classificaion where each sample is one row of a 2d array of shape (n_samples, n_classes) with binary values: the one, i.e. the non zero elements, corresponds to the subset of labels. An array such as np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]]) represents label 0 in the first sample, labels 1 and 2 in the second sample, and no labels in the third sample.

For example, for the male and female classes - if an image is a female, this will be represented as [1.0, 0.0] while if an image is a male, this will be represented as [0.0, 1.0], if an image is a hermaphrodite, it is represented as [1.0, 1.0].

We store the training data in X_train and the binarized labels in y_train.

The getitem function takes an index as input and returns the particular image and label associated with that index. There is a bunch of pre-processing we need to do on the image to make it into a nice normalized form that can be fed to the network. We resize the image to 256 x 256. The face detection network works even if the images are of different sizes. However, if the image is of any arbitrary size, our FC layer will not get a fixed sized input (which it does require). We therefore choose to resize the image.

We can alternatively do the above pre-processing with the transform function by PyTorch and uncomment the lines:

        #if self.transform is not None:
        #    img = self.transform(img)

and write your transform function as:

#transformations = transforms.Compose(
#    [
#     transforms.ToTensor(),
#     .. # bunch of other pre-processing functions
#     transforms.Normalize(mean=[104,117,123])
#     ])

Whatever suits you works here.

Now that we have defined the dataset, we are ready to load this dataset into PyTorch’s Data Loader.

train_data = "index.csv"
img_path = "data/Celeb_Small_Dataset/"
img_ext = ".jpg"
dset = CelebDataset(train_data,img_path,img_ext,transformations)
train_loader = DataLoader(dset,
                          num_workers=1 # 1 for CUDA
                         # pin_memory=True # CUDA only

Great! We have defined the data loader. Next, let’s make sure that once we do start the training, we are able to save the trained models at different checkpoints whenever we wish to. (Imagine running the training for 5 hours and then losing the trained model because you forgot to uncomment a line after – that happens!)

def save(model, optimizer, loss, filename):
    save_dict = {
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        }, filename)

Great! Now we are ready to define our train function and get to the meat of the code.

def train_model(model, criterion, optimizer, num_classes, num_epochs = 100):

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        running_loss = 0.0

        for i,(img,label) in enumerate(train_loader):
            img = img.view((1,)+img.shape[1:])
            if use_cuda:
                data, target = Variable(img.cuda()), Variable(torch.Tensor(label).cuda())
                data, target = Variable(img), Variable(torch.Tensor(label))
            target = target.view(num_classes,1)

            # Make sure to start with zero gradients otherwise, 
            # gradients from the previous run will affect your current run.


            # forward pass - all the magic happens here.

            outputs = model(data)

            # our loss function calculates the loss between the outputs and 
            # targets vectors
            loss = criterion(outputs, target)

            # let's print some output so we know the network is doing something

            if i%50==0:
                print("Reached iteration ",i)
                running_loss +=[0]

            # backward pass - the magic actually happens here.
            # and the optimizer step - where the parameters are updated based on 
            # the gradients calculated in the backward pass.

            # let's keep a track of the loss through an epoch.

            running_loss +=[0]
        if epoch % 10 == 0:
            save(model, optimizer, loss, 'faceRecog.saved.model')

Great! So that’s how training will work. Now let’s load our pre-trained S3fd model and switch it’s parameters.

# define the number of classes in your problem
# For gender recognition, I have 2 classes.

num_classes = 2

# Initialize your model's network architecture by calling S3fd

myModel = s3fd(num_classes)

# Load the pre-trained model.

loadedModel = torch.load('s3fd_convert.pth')

# Our new model has different layers from the old model.

newModel = myModel.state_dict()

# We only want to use the layers from the pre-trained model that are defined in
# our new model

pretrained_dict = {k: v for k, v in loadedModel.items() if k in newModel}

# Let's update our model from weights extracted from the pre-trained model


# Load the state and we are good to go.


Now, if you are using GPU, you want to turn the use_cuda = True. It is also a good idea to check that your network is correctly loaded by printing the myModel.eval()

use_cuda = True

Great! We are almost done. Now let’s call the methods above and put everything together.

# Define your loss function to what you want.
# Play around with this.

criterion = nn.MSELoss()

# Turn the requires_grad off for all the parameters in your model.
# Remember you don't want to train and change the weights of any of the layers
# except the final FC layer.

for param in myModel.parameters():
    param.requires_grad = False

# This layer was already there but we can re-instantiate it.
# fc_1 layer by default will now contain requires_grad = True.
# This will be the only layer that actually learns weights from the data.

myModel.fc_1 = nn.Linear(2304,num_classes)

# Define the optimization method, learning rate, momentum.
# You can use ADAM etc.
# Play around with this.
# Note that we need to send it the parameters to be optimized. 
# We filter and only send it those parameters with requires_grad = True. 
# (i.e. the FC Layer params)

optimizer = optim.SGD(filter(lambda p: p.requires_grad,myModel.parameters()), lr=0.0001, momentum=0.9)

if use_cuda:
    myModel = myModel.cuda()

# Call the training function defined above.
model_ft = train_model(myModel, criterion, optimizer, num_classes, num_epochs=100)

And, we are done!

the output looks something like this:

Epoch 0/99
Reached iteration  0
Reached iteration  50
Reached iteration  100
Reached iteration  150
Epoch 1/99
Reached iteration  0
Reached iteration  50
Reached iteration  100
Reached iteration  150
Epoch 2/99
Reached iteration  0
Reached iteration  50
Reached iteration  100
Reached iteration  150


The loss values over the entire training set are printed.

My test data was not in a well formatted manner like the train data. So I simply tested my outputs on a few images and saw if I got the correct one:

def transform(img_path):
        img = cv2.imread(img_path)
        img = cv2.resize(img, (256,256))
        img = img - np.array([104,117,123])
        img = img.transpose(2, 0, 1)
        img = img.reshape((1,)+img.shape)
        img = torch.from_numpy(img).float()
        return Variable(img.cuda())
myModel = myModel.cuda()
testImage1 = transform('data/Test/TestCeleb_4/25-FaceId-0.jpg')
testImage2 = transform('data/Test/TestCeleb_4/26-FaceId-0.jpg')
testImage3 = transform('data/Test/TestCeleb_4/27-FaceId-0.jpg')
testImage4 = transform('data/Test/TestCeleb_10/25-FaceId-0.jpg')
testImage5 = transform('data/Test/TestCeleb_10/26-FaceId-0.jpg')
testImage6 = transform('data/Test/TestCeleb_10/24-FaceId-0.jpg')

output1 = myModel(testImage1)
output2 = myModel(testImage2)
output3 = myModel(testImage2)
output4 = myModel(testImage4)
output5 = myModel(testImage5)
output6 = myModel(testImage6)
print("testImage1 - ",output1)
print("testImage2 - ",output2)
print("testImage3 - ",output3)
print("testImage1 - ",output4)
print("testImage2 - ",output5)
print("testImage3 - ",output6)

Where do we go from here?

  1. Play with the parameters, see what works better. The dataset is a small subset of the CelebA dataset - CelebA Dataset You can try increasing the size of the data and see if that helps!

  2. Change the network architecture. Use the features from the Predicted Convolutional Layers to do the classification. Instead of FC layers, you can add Convolution Layers, Pooling layers. But remember, too many layers means too many weights. Make sure your dataset is big enough to learn from.

  3. The network detects faces at different scales. Can you make assumptions about which scale wil give the best performance? This should be possible since the CelebA dataset has cropped pictures of only the face and we are resizing it to a fixed size. If you can make this assumption and find which of the different Detection Layers is responsible for the detection, you can use it’s feature vector instead of using conv7_2 as we did in this tutorial.

  4. A more difficult approach - Copy the feature maps that yielded “Face” in the detection and train classification layers over it.

  5. Find the co-ordinates of the face detected, feed the cropped face back to the network, and extract the feature vector from one of the Detection Layers. Then classify it.

  6. Cats