Metrics for object detection
衡量目标检测的标准
出现的原因
由于非二元的object detection很难评价一个模型是好是坏,所以需要制定一个新的标准。这些标准的确立都是在不用的网络比赛中制定出来的例如:
- PASCAL VOC Challenge offers a Matlab script in order to evaluate the quality of the detected objects. Participants of the competition can use the provided Matlab script to measure the accuracy of their detections before submitting their results. The official documentation explaining their criteria for object detection metrics can be accessed here. The current metrics used by the current PASCAL VOC object detection challenge are the Precision x Recall curve and Average Precision.
The PASCAL VOC Matlab evaluation code reads the ground truth bounding boxes from XML files, requiring changes in the code if you want to apply it to other datasets or to your speficic cases. Even though projects such as Faster-RCNNimplement PASCAL VOC evaluation metrics, it is also necessary to convert the detected bounding boxes into their specific format. Tensorflow framework also has their PASCAL VOC metrics implementation. - COCO Detection Challenge uses different metrics to evaluate the accuracy of object detection of different algorithms. Here you can find a documentation explaining the 12 metrics used for characterizing the performance of an object detector on COCO. This competition offers Python and Matlab codes so users can verify their scores before submitting the results. It is also necessary to convert the results to a format required by the competition.
- Google Open Images Dataset V4 Competition also uses mean Average Precision (mAP) over the 500 classes to evaluate the object detection task.
- ImageNet Object Localization Challenge defines an error for each image considering the class and the overlapping region between ground truth and detected boxes. The total error is computed as the average of all min errors among all test dataset images. Here are more details about their evaluation method.
关于mAP的一些基础定义
MAP:一般理解为PR曲线下的面积
PR曲线:Precision-Recall曲线,纵坐标是准确率Precision,横坐标是召回率Recall,随着召回率的上升,准确率下降
例子讲解
An example helps us understand better the concept of the
interpolated average precision. Consider the detections below:
There are 7 images with 15 ground truth objects represented by
the green bounding boxes and 24 detected objects represented by the red
bounding boxes. Each detected object has a confidence level and is identified by a letter (A,B,…,Y).
这里有七张图和15个待检测的物体用绿色方框表示, 24个检测方框用红色方框表示。
The following table shows the bounding boxes with their corresponding
confidences. The last column identifies the detections as TP or FP. In this
example a TP is considered if IOU $\geq$ 30%, otherwise it is a FP. By looking at the images above we can roughly tell if the detections are TP or FP.
下面的表格展示了预测框对应的置信值。 最后一列来定义是TP还是FP。
In some images there are more than one detection overlapping
a ground truth (Images 2, 3, 4, 5, 6 and 7). For those cases the detection with the highest IOU is considered TP and the others are considered FP. This rule is applied by the PASCAL VOC 2012 metric: “e.g. 5 detections (TP) of a single object is counted as 1 correct detection and 4 false detections”.
如果同一个物体被多个检测框检测到,选用最大IOU的框作为TP,其他的都看作是FP。
The Precision x Recall curve is plotted by calculating the precision and recall values of the accumulated TP or FP detections. For this, first we need to order the detections by their confidences, then we calculate the precision and recall for each accumulated detection as shown in the table below:
要画出PR曲线图,就需要计算precision和recall。下表格中展示数据:
$$precision = \frac{TP}{TP+FP}$$
$$recall = \frac{TP}{TP+FN}$$
在这里 $FN=15 - TP$ (ground true objects - 被检测出的物体数量)
Plotting the precision and recall values we have the
following Precision x Recall curve:
As mentioned before, there are two different ways to
measure the interpolted average precision: 11-point interpolation and interpolating all points.
Below we make a comparison between them:
Calculating the 11-point interpolation
The idea of the 11-point interpolated average precision is to average the precisions at a set of 11 recall levels (0,0.1,…,1). The interpolated precision values are obtained by taking the maximum precision whose recall value is greater than its current recall value as follows:
11-point interpolation 是建立11个recall level(0, 0,1, 0.2, …, 0.9, 1)。当此时recall的precision值大于之前的precision值,将之前的precision value替换
By applying the 11-point interpolation, we have:
$$ AP = \frac{1}{11} \sum_{r\in(0, 0.1, …,0.9, 1)} precsion(r) $$
$$ AP = \frac{1}{11}(1+0.6666+0.4285+0.4285+0.4285+0+0+0+0+0+0)$$
$$ AP = 26.84\% $$
Calculating the interpolation performed in all points
By interpolating all points, the Average Precision (AP) can be interpreted as an approximated AUC of the Precision x Recall curve. The intention is to reduce the impact of the wiggles in the curve. By applying the equations presented before, we can obtain the areas as it will be demonstrated here. We could also visually have the interpolated precision points by looking at the recalls starting from the highest (0.4666) to 0 (looking at the plot from right to left) and, as we decrease the recall, we collect the precision
values that are the highest as shown in the image below:
Looking at the plot above, we can divide the AUC into 4 areas (A1, A2, A3 and A4):
Calculating the total area, we have the AP:
$$ AP = A1+A2+A3+A4 $$
with
$$A1 = (0.0666-0) \times 1 = 0.0666$$
$$A2 = (0.1333-0.0666)\times 0.6666=0.04446222$$
$$A3 = (0.4-0.1333)\times 0.4285 = 0.11428095$$
$$A4 = (0.4666-0.4)\times0.3043=0.02026638$$
$$AP = 0.0666+0.04446222+0.11428095+0.02026638$$
$$AP = 24.56\%$$
The results between the two different interpolation methods are a little different: 24.56% and 26.84% by the every point interpolation and the 11-point interpolation respectively.