YOLO with pytorch (Build from scratch)

Github

You Only Look Once (YOLO) is a pioneering object detection algorithm that revolutionized the field of computer vision. Developed by Joseph Redmon and Santosh Divvala, YOLOv1 introduced a novel approach to real-time object detection by dividing an image into a grid and predicting bounding boxes and class probabilities for each grid cell. This single-pass architecture significantly improved detection speed without compromising accuracy, making it a go-to choice for various applications, including autonomous vehicles, surveillance systems, and image recognition. The simplicity and efficiency of YOLOv1 marked a milestone in the evolution of object detection algorithms, setting the stage for further advancements in the field.

I know that this is a relatively old model and there are lot of implementations available, but this model holds a significant place in onbject detection research and I thought implementing it from scratch would be a good review on the basics.

If you are interested, read the paper!


How it works


Loading data from Pascal VOC dataset

The Pascal VOC dataset is a dataet for object detection. The main versions are 2007 and 2012. We are going to use 2012 version.

Classes included in this dataset are: (“aeroplane”, “bicycle”, “bird”, “boat”, “bottle”, “bus”, “car”, “cat”, “chair”, “cow”, “diningtable”, “dog”, “horse”, “motorbike”, “person”, “pottedplant”, “sheep”, “sofa”, “train”, “tvmonitor”).

To download it you can use PyTorch’s VOCDetection helper class:

from torchvision.datasets.voc import VOCDetection
dataset = VOCDetection(
    root=config.DATA_PATH,
    year="2012",
    image_set=("train" if type == "train" else "val"),
    download=download,
    transform=T.Compose(
        [
            T.PILToTensor(),
            T.Resize(config.IMAGE_SIZE),
            T.ToDtype(torch.float32, scale=True),
        ]
    ),
)

As you can see you have to determine path to store the data (root), year, train or val type, download or load from disk and transforms. This is a initial trnsform to convert the loaded image to Tensor and then resize to the shape of model input. Changing the datatype from uint8 to float32 is neceary since we gonna normalize image later.

Each image has a XML label that determines the class, bounding box coordinates and image size. It has some other data but we don’t need it.

dataset xml

As you can see it jost provides us with two points in the bounding box, but YOLOv1 output should be in format of center coordinates, width, and height. In order to do that we need to calculate it and to avoid instbility it is normalized to between [0, 1]:

eq1

Data augmentation

In order to increase model robustness and data and to avoid overfitting, data augmentation is used. Based on the paper they have applied random scaling and translations, randomly adjusting exposure and staturation in HSV color space. In order to achieve this we use following transformation from PyTorch:

x_shift = int((0.2 * random.random() - 0.1) * config.IMAGE_SIZE[0])
y_shift = int((0.2 * random.random() - 0.1) * config.IMAGE_SIZE[1])
scale = 1 + 0.2 * random.random()

# Augment images
if self.augment:
    data = TF.affine(data, angle=0.0, scale=scale, translate=(x_shift, y_shift), shear=0.0)
    data = TF.adjust_hue(data, 0.2 * random.random() - 0.1)
    data = TF.adjust_saturation(data, 0.2 * random.random() + 0.9)

These are the original augmentation tehniques that are used in paper. You can add other transformations. But beware to recalculate the bbox locations based on transformations used. Here we have to recalculate for scale and shifts, because only these transformations effect the bbox locations.

Normalize

Normalizing the input image (or any other type of input) usually has lots of benefits such as:

data = TF.normalize(data, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

The reason that everyone uses these values as mean and std is interesting. It is remnant of ImageNet dataset and competitions and it has become defcto standard in many vision applicaitons.

Confidence

Since each bbox has a class and is annotated by human, we are sure that there is object present provided coordinates. Therefore we will set the confidence of each object and corresponding class to 1.0.


YOLO Architecture

The original arhitecture in the paper uses Darknet19 backbone.

arch

As you can see, it takes 3x448x448 rgb image. The outout is 7x7x30 (refer to How it works section). We have to calculate padding of each layer, since paddings are not inluded in the architecture description. In order to calculate that we have to reverse the following formula that calculates output sizes of convoltion and MaxPool layers (ref):

conv2d_eq

where dilation is 1 (dialation makes convoltion layer to skip pixel in the output, not important for YOLOv1). By calculating the padding we can have the whole structure figured out. I have separated the backbone and Head of the model, and I connect them together to make the whole model. The reason is that to train the model, I don’t have enough resources to train it from scratch on ImageNet dataset and then fine tune it on Pascal VOC. In paper, originaly they have trained it for one week on ImageNet 2012 to achive 88% accuracy.

In order to train the model faster, I’m gonna use pretrained ResNet50 (trained on imagenet) and then just modify the output layers and train only these layers (not the backbone). So the Head module of the model is going to be the same for both models:

class YOLOv1(nn.Module):
    def __init__(self, pretrained=True):
        super().__init__()

        if pretrained:
            self.model = nn.Sequential(Resnet_Backbone(), Head_Module())
        else:
            self.model = nn.Sequential(YOLO_from_scratch_backbone(), Head_Module())

    def forward(self, x):
        return self.model.forward(x)

In the Head model we have two Fully Connected layers. It also has a dropout after first layer. we need to reshape the last layer to the output shape we want (7x7x30).

class Head_Module(nn.Module):
    def __init__(self):
        super().__init__()
        self.depth = config.B * 5 + config.C

        layers = []
        layers += [
            nn.Flatten(),
            # FC 1
            nn.Linear(config.S * config.S * 1024, 4096),
            nn.Dropout(p=0.5),
            nn.LeakyReLU(negative_slope=0.1),
            # FC 2
            nn.Linear(4096, config.S * config.S * self.depth),
            nn.Sigmoid(),
        ]

        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return torch.reshape(self.model.forward(x), (x.size(dim=0), config.S, config.S, self.depth))

Loss function

In order to calculate the grdient we need a value to calculate the overall accuracy of the model. In the paper, the loss funcion is.

loss

where 1_obj_i denote if object appears in cell i, and 1_obj_ij denotes that the jth bounding box predictor in cell i is responsible for the prediction. Responsible means that the predicted bbox hsa largest overlap (IOU) with the ground truth. So for each predicted bbox we need to calculate the Intersetion Over Unoion (IOU) with ground truth (label) bbox and only keep the prediction with maximum IOU (non max suppression).

By looking at the output of the model, we can see mot prts of the loss function are intuitive.

As stated in the paper:

We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of
maximizing average precision. It weights localization error equally with classification error which may not be ideal.
Also, in every image many grid cells do not contain any
object. This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells
that do contain objects. This can lead to model instability,
causing training to diverge early on.
To remedy this, we increase the loss from bounding box
coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We
use two parameters, λcoord and λnoobj to accomplish this. We
set λcoord = 5 and λnoobj = .5.

IOU (intersection over union)

It is a metric used in objet detection to evluate accuracy of the predicted bounding boxes compared to ground truth bounding boxes (essentially calcultes the overlap).

Formula

IOU = (Area of Intersection) / (Area of Union) iou

Process to calculate it:

  1. Get two bboxes, one predicted and one ground truth.
  2. Calculate area of intersetion: use min and max of X and Y of the boxes to get overlap X and Y. Then multiply to get the area. X = x2-x1 and Y = y2-y1. If any of X and Y are smaller than 0 (negative), it means there no intersection so we clamp it to zero.
# Calculate intersection points
x1 = torch.max(box1_x1, box2_x1)
x2 = torch.min(box1_x2, box2_x2)
y1 = torch.max(box1_y1, box2_y1)
y2 = torch.min(box1_y2, box2_y2)

# Calculate the intersection area
# Clamp by 0 when there is no intersection
intersection = (x2 - x1).clamp(0) * (y2 - y1).clamp(0)
  1. calculate the union area: union = area_bbox_predicted + area_bbox_ground_truth - intersection_area.

we add an Epsilon (small amount) to void division by zero in the next step in case union is zero.

  1. calculate iou: intersection / union

Non max suppression

It is a technique to keep only the most relevent bounding boxes for each object.

Process

  1. Sort bboxes based on confidence scores, with the highest score first.
  2. Select next bbox in the sorted list
  3. Calculate the IOU of the selected bbox with all other bboxes
  4. Suppress overlapping boxes, remove bboxes that have high IOU (over threshold) with the selected bounding box. This ensures that redundant detections or highly overlapping but lower confidence bboxes are removed.
  5. Repeat steps 2 to 4 until all of the boxes are processed

Note: In the following code I use a set() to keep track of suppressed bboxes. I add to this list if bbox is not the selected bbox, classes are the same and IOU is more than the threshold.

def nms(bboxes, iou_threshold):
    """
    Perform non-maximum suppression on a list of bounding boxes.

    Parameters:
    - bboxes (List[List[float]]): List of bounding boxes.
    - iou_threshold (float): IoU threshold for suppression.

    Returns:
    - List[List[float]]: Suppressed bounding boxes.
    """
    result = []
    num_boxes = len(bboxes)

    # Sort based on confidence
    bboxes.sort(key=lambda x: x[4], reverse=True)

    bboxes = torch.Tensor(bboxes).unsqueeze(1)
    suppressed = set()
    # Overlap Check
    for i in range(num_boxes):
        if i in suppressed:
            continue

        curbox = bboxes[i]
        ious = get_iou(curbox[..., 0:4], bboxes[..., 0:4], "corner")

        # curbox = [x1, y1, x2, y2, confidence, class_index]
        for j in range(i + 1, num_boxes):
            if i != j and curbox[0, -1] == bboxes[j, 0, -1] and ious[j].item() > iou_threshold:
                suppressed.add(j)

    for i in range(num_boxes):
        if i not in suppressed:
            result.append(bboxes[i][0])
    return result

There is another way to go about nms, in that you use argsort to sort the indices(not the bboxes) and add bbox to result evety loop. This way you have the modify and delete the suppress bbox indices. But in the end they output the same result, maybe I develope the second way later!

Loss Code

# Calculate the iou's for each bbox in each cell - (batch, S, S)
iou_b1 = get_iou(preds[..., 21:25], targets[..., 21:25])
iou_b2 = get_iou(preds[..., 26:30], targets[..., 26:30])

_, bestbox_indices = torch.max(torch.stack([iou_b1, iou_b2]), dim=0)  # (batch, S, S)

bestbox_indices is a matrix which tells us in each grid cell which prediction is responsible for the detection. In case of B=2, 0 is the first and 1 is the second prediction. (refer to How it Works to understand B)

We are going to use this matrix to calculate obj_ij and later the loss by multiplying it to other matrices. Since any number ecxept 1, will just ruin the loss for that prediction(in case B=2, 0), we have to create a matrix that each value is one if there is object in that cell. To differentite between prediction, we just stack them together. For example in the following code snippet, we stack two bestbox_indices together but convert the 0s to 1 for the first matrix.

responsible = torch.stack([1 - bestbox_indices, bestbox_indices], dim=-1)
obj_i = targets[..., 20].unsqueeze(3) > 0.0 
# exists * responsible
obj_ij = obj_i * responsible
noobj_ij = 1 - obj_ij
# XY losses
x_loss = self.mse_loss(obj_ij * bbox_attr(preds, 1), obj_ij * bbox_attr(targets, 1))
y_loss = self.mse_loss(obj_ij * bbox_attr(preds, 2), obj_ij * bbox_attr(targets, 2))
xy_loss = x_loss + y_loss
# WH losses
p_width = bbox_attr(preds, 3)  # (batch, S, S, B)
t_width = bbox_attr(targets, 3)  # (batch, S, S, B)

# since predicted width can be negative we use abs. To accomodate for gradient going to zero,
# we add epsilon. To preserve gradient direction we add sign
width_loss = self.mse_loss(
    obj_ij * torch.sign(p_width) * torch.sqrt(torch.abs(p_width) + config.EPSILON),
    obj_ij * torch.sqrt(t_width),
)
p_height = bbox_attr(preds, 4)  # (batch, S, S, B)
t_height = bbox_attr(targets, 4)  # (batch, S, S, B)
height_loss = self.mse_loss(
    obj_ij * torch.sign(p_height) * torch.sqrt(torch.abs(p_height) + config.EPSILON),
    obj_ij * torch.sqrt(t_height),
)
wh_loss = width_loss + height_loss

Note tht we do not use sign and abs functions on target w and h, since we are sure that these values are positive.

# Condifdence losses
obj_confidence_loss = self.mse_loss(obj_ij * bbox_attr(preds, 0), obj_ij * bbox_attr(targets, 0))
noobj_confidence_loss = self.mse_loss(noobj_ij * bbox_attr(preds, 0), noobj_ij * bbox_attr(targets, 0))
# Class losses
class_loss = self.mse_loss(obj_i * preds[..., : config.C], obj_i * targets[..., : config.C])
total_loss = (
    self.lambda_coord * (xy_loss + wh_loss)
    + obj_confidence_loss
    + self.lambda_noobj * noobj_confidence_loss
    + class_loss
)

Train

Just a simple pytorch trian loop.

for data, labels, _ in tqdm_dataloader:
    data = data.to(device)
    labels = labels.to(device)

    optimizer.zero_grad()
    outputs = model(data)
    loss = loss_fn(outputs, labels)
    loss.backward()

    optimizer.step()