non_max_suppression code analysis
NMS was performed simply according to confidence
def non_max_suppression(boxes, conf_thres=0.5, nms_thres=0.3): detection = boxes # 1. Find the box in the picture whose score is greater than the threshold function. The number of coincident boxes can be greatly reduced by filtering scores before screening coincident boxes. mask = detection[:, 4] >= conf_thres detection = detection[mask] if not np.shape(detection)[0]: return [] best_box = [] scores = detection[:,4] # 2. Sort the boxes from large to small according to the score. arg_sort = np.argsort(scores)[::1] detection = detection[arg_sort] while np.shape(detection)[0]>0: # 3. Take out the box with the largest score each time and calculate its coincidence degree with all other prediction frames. If the coincidence degree is too large, it will be eliminated. best_box.append(detection[0]) if len(detection) == 1: break ious = iou(best_box[1],detection[1:]) detection = detection[1:][ious<nms_thres] return np.array(best_box)
 According to conf_ Thresholding for confidence screening. detection[0:4] is the box coordinate and detection[4] is the confidence.
 After filtering the boxes with confidence greater than the threshold, they are ranked from large to small according to the confidence (or score). Get the list of detection = detection[arg_sort].
 Take out the box detection[0] with the highest score, calculate iou with other box detection[1:] and eliminate iou greater than NMS_ Boxes for thres thresholds. The reserved iou is less than nms_thres threshold box for iteration.
Multi classification NMS
def non_max_suppression(prediction, num_classes, input_shape, image_shape, letterbox_image, conf_thres=0.5, nms_thres=0.4): # prediction[0:4] is the box coordinate, prediction[4] is the confidence, and prediction[5:5 + num_classes] is the score of each type of task output = [None for _ in range(len(prediction))] ## # Cycle the input picture, usually only once ## for i, image_pred in enumerate(prediction): # Returns the element with the largest value in each row and its index (returns the column index of the largest element in this row) class_conf, class_pred = torch.max(image_pred[:, 5:5 + num_classes], 1, keepdim=True) ## # First round screening using confidence ## conf_mask = (image_pred[:, 4] * class_conf[:, 0] >= conf_thres).squeeze() if not image_pred.size(0): continue ## # detections [num_anchors, 7] # The contents of 7 are: x1, y1, x2, y2, obj_conf, class_conf, class_pred ## detections = torch.cat((image_pred[:, :5], class_conf, class_pred.float()), 1) detections = detections[conf_mask] nms_out_index = boxes.batched_nms( detections[:, :4], detections[:, 4] * detections[:, 5], detections[:, 6], nms_thres, ) output[i] = detections[nms_out_index]
 prediction[0:4] is the box coordinate, prediction[4] is the confidence, and prediction[5:5 + num_classes] is the score of each type of task.
`torch.max(a,1)` # Returns the element with the largest value in each row and its index (returns the column index of the largest element in this row)

class_conf is the maximum value of each row returned by torch.max, that is, the value with the maximum score. class_pred is the index of the maximum value of each row returned by torch.max, that is, the category.

detections[0:4] is the box coordinate, detections[4] is the confidence, detections[5] is the classification score, and detections[6] is the category index.
 Get the maximum score of all boxes and the corresponding category
 Calculate the classification confidence of the box through the confidence * score, according to conf_thres threshold is used to filter the classification confidence.
 Through torchvision.ops.boxes.batched_nms for NMS.
Torchvision.ops.batched_nms()
Filter according to each category, and only calculate IOU and threshold filtering for the same category.
Parameters:
Parameter name  Parameter type  explain 

boxes  Tensor  Prediction frame 
scores  Tensor  Predictive confidence 
idxs  Tensor  Forecast box category 
iou_threshold  float  IOU threshold 
return:
Parameter name  Parameter type  explain 

keep  Tensor  Predicted NMS filtered bounding boxes index (in descending order) 
The form of boxes is the format of (x1,x2,y1,y2), where 0 < = X1 < X2, 0 < = Y1 < Y2. That is, the form of the upper left corner and the lower right corner.
Then, the form of the center point and width height of the network output needs to be constructed into the form of the upper left corner and the lower right corner.
box_corner = prediction.new(prediction.shape) box_corner[:, :, 0] = prediction[:, :, 0]  prediction[:, :, 2] / 2 box_corner[:, :, 1] = prediction[:, :, 1]  prediction[:, :, 3] / 2 box_corner[:, :, 2] = prediction[:, :, 0] + prediction[:, :, 2] / 2 box_corner[:, :, 3] = prediction[:, :, 1] + prediction[:, :, 3] / 2 prediction[:, :, :4] = box_corner[:, :, :4]
A series of box forms triggered by NMS using torch
Generally speaking, the coordinates output by the network are in the form of center point and width height.
For example:

outputs[..., :2] = (outputs[..., :2] + grids) * strides outputs[..., 2:4] = torch.exp(outputs[..., 2:4]) * strides # normalization outputs[..., [0,2]] = outputs[..., [0,2]] / input_shape[1] outputs[..., [1,3]] = outputs[..., [1,3]] / input_shape[0]

pred_boxes[..., 0] = x.data + grid_x pred_boxes[..., 1] = y.data + grid_y pred_boxes[..., 2] = torch.exp(w.data) * anchor_w pred_boxes[..., 3] = torch.exp(h.data) * anchor_h
Since the output of the network is in the form of center point and width height, the xml marked by labelimg is in the form of upper left corner and lower right corner.
Therefore, during training, the coordinate transformation of the box should be done when obtaining the target. Usually in get_ In the target method, there are the following:
# in_h,in_w is the required height and width of the input picture batch_target[:, [0,2]] = targets[b][:, [0,2]] * in_w batch_target[:, [1,3]] = targets[b][:, [1,3]] * in_h batch_target[:, 4] = targets[b][:, 4]
Why do you do this instead of using the upper left and lower right coordinates directly?
Personally, I think it's because the form of center point and width and height has semantic information. Such training will reduce the possibility of network errors.