The goal was to recognize objects from a number of visual object classes in realistic scenes. It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. The twenty object classes that have been selected are:
Person: person
Animal: bird, cat, cow, dog, horse, sheep
Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
Dataset used: PASCAL VOC 2012
It is the task of correctly recognizing and efficiently locating multiple objects in an image automatically, by a supervised machine, given a sufficiently large training set, while avoiding false predictions and multiple bounding boxes of the same object.
● Single stage Object detectors
● Two stage object detectors
The architecture for most of the object detectors of this type has the following stages:
● First stage is responsible for generating region of interest using Region Proposal Network (RPN),
● In the second stage, the network is responsible for optimizing the classification and bounding boxes for the proposed region. Some of the two stage object detection algorithms are
● R-CNN
● Fast R-CNN
● Faster R-CNN
● Feature Pyramid Network
Single stage object detectors have a single deep network to predict the bounding boxes as well as object confidence score. In one stage object detectors the image is passed through a single network for classification and localization. And has the intuition that localization is a regression problem. Some of the one stage object detection algorithms:
● YOLO
● YOLO (v2)
● YOLO(v3)
● YOLO(v4)
Loss function of Yolo versions on PASCAL VOC 2012 dataset
mean Average Precision(mAP) of Yolo versions on PASCAL VOC 2012 dataset
[1] Yolo-1 paper
[2] Yolo-2 paper
[3] Yolo-3 paper
[4] Yolo-4 paper