目标检测5--旷视YOLOX算法介绍

文章目录

欢迎访问个人网络日志🌹🌹知行空间🌹🌹

论文地址：http://arxiv.org/abs/2107.08430

代码：https://github.com/Megvii-BaseDetection/YOLOX

1.简介

2021年07月份，旷视的Zheng Ze与Sonttao Liu提交的论文中提出的Anchor Free检测算法。主要工作聚焦在a decoupled head和label assigment strategy SimOTA。作者使用YOLOX获得了2021年CVPR Autonomous Driving领域Streaming Perception Challenge的第一名BaseDet。

之前YOLO系列的论文自YoloV1后都是Anchor Based的，但自那之后如CornerNet/CenterNet/FCOS等Anchor Free的算法不断进步，YoloX的作者再次尝试将Anchor Free的算法技巧应用到Yolo算法上。作者认为YoloV4/YoloV5属于优化过度的Anchor Based算法，因此其提出的YoloX算法主要与YoloV3做比较。YoloX中的使用的baseline是YoloV3-SPP,YoloV3-SPP中作者引入了EMA权值更新， cosine lr schedule等策略。

2.YoloX所做的主要工作

2.1 分类/回归头解耦(Decoupled head)

在R-CNN系列目标检测论文中，bounding box位置回归和物体类别判断都是分两个输出头来做的，而Yolo系列论文使用的Yolo Head都是在同个向量中通过共同的神经网络同时输出回归和类别信息。如下图1,在YoloX中作者重新使用了Decoupled Head，通过实验证明了其能提升检测的效果，

在这里插入图片描述

1)加快收敛速度

在这里插入图片描述

2)使用NMS Free做End2End Yolo时性能下降小。

在这里插入图片描述

2.2 SimOTA

Label Assignment时目标检测中的关键点，以往多是基于Max IoU来实现Anchor Box正负标签的判定，属于基于先验知识的静态匹配方式，因一张图像中目标的数量是有限的，因此会将大量的Anchor判定为negative,造成严重的类别不平衡问题。ATSS提出了自适应训练样本采样算法，自适应确定判断Anchor正负标签的IoU阈值。Label Assignment问题是结合Ground Truth Boxes给每个Anchor Box分配类别，然后得到全局最优的结果，这个问题正好满足二分图优化的模型，因此也有不少工作是基于匈牙利算法，最优路经传输(Optimal transport assignment)来做的。OTA是YoloX作者2021年04月提交的论文中提出的方法，实现label的动态匹配。

advanced label assignment的四个要点

1）loss/quality aware
2）center prior
3）dynamic number of positive anchors
4） global view

YoloX作者指出通过Sinkhorn-Knopp算法求解OTA问题增加了约25%的训练时间，因此在YoloX中使用被称为SimOTA的dynamic top-K算法求近似解。

SimOTA算法流程：

1）先对每个prediction-gt pair计算代价cost

$cost_{ij}=L^{cls}_{ij} + \lambda L^{reg}_{ij}$

2）对每个 $g_i$ 选择在一个中心范围内cost最小的top k个prediction作为positive sample,其余作为negative samples,超参数k对不同的ground truth box取不同的值，其选择见作者在OTA论文中的介绍和这篇博客。

源码分析:

参考mmdetection中SimOTA的实现，见mmdet/core/bbox/assigners/sim_ota_assigner.py

class SimOTAAssigner(BaseAssigner):
    def __init__(self,):
     ...

    def _assign(self,
                pred_scores,
                priors,
                decoded_bboxes,
                gt_bboxes,
                gt_labels,
                gt_bboxes_ignore=None,
                eps=1e-7):
        """Assign gt to priors using SimOTA.
        Args:
            pred_scores (Tensor): Classification scores of one image,
                a 2D-Tensor with shape [num_priors, num_classes]
            priors (Tensor): All priors of one image, a 2D-Tensor with shape
                [num_priors, 4] in [cx, xy, stride_w, stride_y] format.
            decoded_bboxes (Tensor): Predicted bboxes, a 2D-Tensor with shape
                [num_priors, 4] in [tl_x, tl_y, br_x, br_y] format.
            gt_bboxes (Tensor): Ground truth bboxes of one image, a 2D-Tensor
                with shape [num_gts, 4] in [tl_x, tl_y, br_x, br_y] format.
            gt_labels (Tensor): Ground truth labels of one image, a Tensor
                with shape [num_gts].
            gt_bboxes_ignore (Tensor, optional): Ground truth bboxes that are
                labelled as `ignored`, e.g., crowd boxes in COCO.
            eps (float): A value added to the denominator for numerical
                stability. Default 1e-7.
        Returns:
            :obj:`AssignResult`: The assigned result.
        """
        INF = 100000.0
        num_gt = gt_bboxes.size(0)
        num_bboxes = decoded_bboxes.size(0)

        # assign 0 by default
        assigned_gt_inds = decoded_bboxes.new_full((num_bboxes, ),
                                                   0,
                                                   dtype=torch.long)
        valid_mask, is_in_boxes_and_center = self.get_in_gt_and_in_center_info(
            priors, gt_bboxes)
        valid_decoded_bbox = decoded_bboxes[valid_mask]
        valid_pred_scores = pred_scores[valid_mask]
        num_valid = valid_decoded_bbox.size(0)

        if num_gt == 0 or num_bboxes == 0 or num_valid == 0:
            # No ground truth or boxes, return empty assignment
            max_overlaps = decoded_bboxes.new_zeros((num_bboxes, ))
            if num_gt == 0:
                # No truth, assign everything to background
                assigned_gt_inds[:] = 0
            if gt_labels is None:
                assigned_labels = None
            else:
                assigned_labels = decoded_bboxes.new_full((num_bboxes, ),
                                                          -1,
                                                          dtype=torch.long)
            return AssignResult(
                num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

        pairwise_ious = bbox_overlaps(valid_decoded_bbox, gt_bboxes)
        iou_cost = -torch.log(pairwise_ious + eps)

        gt_onehot_label = (
            F.one_hot(gt_labels.to(torch.int64),
                      pred_scores.shape[-1]).float().unsqueeze(0).repeat(
                          num_valid, 1, 1))

        valid_pred_scores = valid_pred_scores.unsqueeze(1).repeat(1, num_gt, 1)
        cls_cost = (
            F.binary_cross_entropy(
                valid_pred_scores.to(dtype=torch.float32).sqrt_(),
                gt_onehot_label,
                reduction='none',
            ).sum(-1).to(dtype=valid_pred_scores.dtype))

        cost_matrix = (
            cls_cost * self.cls_weight + iou_cost * self.iou_weight +
            (~is_in_boxes_and_center) * INF)

        matched_pred_ious, matched_gt_inds = \
            self.dynamic_k_matching(
                cost_matrix, pairwise_ious, num_gt, valid_mask)

        # convert to AssignResult format
        assigned_gt_inds[valid_mask] = matched_gt_inds + 1
        assigned_labels = assigned_gt_inds.new_full((num_bboxes, ), -1)
        assigned_labels[valid_mask] = gt_labels[matched_gt_inds].long()
        max_overlaps = assigned_gt_inds.new_full((num_bboxes, ),
                                                 -INF,
                                                 dtype=torch.float32)
        max_overlaps[valid_mask] = matched_pred_ious
        return AssignResult(
            num_gt, assigned_gt_inds, max_overlaps, labels=assigned_labels)

    def get_in_gt_and_in_center_info(self, priors, gt_bboxes):
        num_gt = gt_bboxes.size(0)

        repeated_x = priors[:, 0].unsqueeze(1).repeat(1, num_gt)
        repeated_y = priors[:, 1].unsqueeze(1).repeat(1, num_gt)
        repeated_stride_x = priors[:, 2].unsqueeze(1).repeat(1, num_gt)
        repeated_stride_y = priors[:, 3].unsqueeze(1).repeat(1, num_gt)

        # is prior centers in gt bboxes, shape: [n_prior, n_gt]
        l_ = repeated_x - gt_bboxes[:, 0]
        t_ = repeated_y - gt_bboxes[:, 1]
        r_ = gt_bboxes[:, 2] - repeated_x
        b_ = gt_bboxes[:, 3] - repeated_y

        deltas = torch.stack([l_, t_, r_, b_], dim=1)
        is_in_gts = deltas.min(dim=1).values > 0
        is_in_gts_all = is_in_gts.sum(dim=1) > 0

        # is prior centers in gt centers
        gt_cxs = (gt_bboxes[:, 0] + gt_bboxes[:, 2]) / 2.0
        gt_cys = (gt_bboxes[:, 1] + gt_bboxes[:, 3]) / 2.0
        ct_box_l = gt_cxs - self.center_radius * repeated_stride_x
        ct_box_t = gt_cys - self.center_radius * repeated_stride_y
        ct_box_r = gt_cxs + self.center_radius * repeated_stride_x
        ct_box_b = gt_cys + self.center_radius * repeated_stride_y

        cl_ = repeated_x - ct_box_l
        ct_ = repeated_y - ct_box_t
        cr_ = ct_box_r - repeated_x
        cb_ = ct_box_b - repeated_y

        ct_deltas = torch.stack([cl_, ct_, cr_, cb_], dim=1)
        is_in_cts = ct_deltas.min(dim=1).values > 0
        is_in_cts_all = is_in_cts.sum(dim=1) > 0

        # in boxes or in centers, shape: [num_priors]
        is_in_gts_or_centers = is_in_gts_all | is_in_cts_all

        # both in boxes and centers, shape: [num_fg, num_gt]
        is_in_boxes_and_centers = (
            is_in_gts[is_in_gts_or_centers, :]
            & is_in_cts[is_in_gts_or_centers, :])
        return is_in_gts_or_centers, is_in_boxes_and_centers

    def dynamic_k_matching(self, cost, pairwise_ious, num_gt, valid_mask):
        matching_matrix = torch.zeros_like(cost, dtype=torch.uint8)
        # select candidate topk ious for dynamic-k calculation
        candidate_topk = min(self.candidate_topk, pairwise_ious.size(0))
        # topk_ious shape: candidate_topk x num_gts
        topk_ious, _ = torch.topk(pairwise_ious, candidate_topk, dim=0)
        # calculate dynamic k for each gt
        # dynamic_ks shape: 1 x num_gts
        # dynamic k正是基于最少`candidate_topk`个iou求和计算出来的，因此是dynamic的
        # 每个 gt对应的topk是不一样的
        dynamic_ks = torch.clamp(topk_ious.sum(0).int(), min=1)
        for gt_idx in range(num_gt):
            _, pos_idx = torch.topk(
                cost[:, gt_idx], k=dynamic_ks[gt_idx], largest=False)
            matching_matrix[:, gt_idx][pos_idx] = 1

        """
        以上生成的`matching_matrix`有可能1个`prior`对应多个`gt_boxes`,还需以下处理
        从1个`prior`对应的多个`gt_boxes`选`cost`最小的那个作为`gt_box`
        """
        del topk_ious, dynamic_ks, pos_idx
        # prior_match_gt_mask shape like (num_priors,)
        prior_match_gt_mask = matching_matrix.sum(1) > 1
        if prior_match_gt_mask.sum() > 0:
            cost_min, cost_argmin = torch.min(
                cost[prior_match_gt_mask, :], dim=1)
            matching_matrix[prior_match_gt_mask, :] *= 0
            matching_matrix[prior_match_gt_mask, cost_argmin] = 1
        # get foreground mask inside box and center prior
        fg_mask_inboxes = matching_matrix.sum(1) > 0
        valid_mask[valid_mask.clone()] = fg_mask_inboxes

        matched_gt_inds = matching_matrix[fg_mask_inboxes, :].argmax(1)
        matched_pred_ious = (matching_matrix *
                             pairwise_ious).sum(1)[fg_mask_inboxes]
        return matched_pred_ious, matched_gt_inds
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

2.3 End2End Yolo(NMS Free)

在2020年12月份，旷视的Jianfeng Wang等提交的论文End-to-End Object Detection with Fully Convolutional Network 提出了不需要使用极大值抑制做后处理的检测方法，其改进了label assigment,将以往one2many的标签分配方式改成了one2one的方式，更多介绍可参考论文作者知乎上的文章。2021年01月份阿里的Qiang Zhou等提交的论文[Object Detection Made Simpler by Eliminating Heuristic NMS](https://arxiv.org/abs/2101.11782)中也使用one2one的标签分配方式实现了NMS Free的目标检测算法，YoloX参考以上方法实现End2End训练得到如下表灰色的实验结果，可以看到对性能和准确率都有一定的影响:

在这里插入图片描述

最近聚焦在研究label assignment，因此这块NMS Free还没来得及看源码，先挖个坑。NMS Free其实也非常有意义，因为NMS存在，在边缘设备部署模型时，后处理部分不能像模型本身能放NPU上加速计算，故需在CPU上运算。

2.4 其他

数据增强方法
- Mosaic:YoloV5作者在ultralytics-YOLOv3中提出的
- MixUp2018年mixup: Beyond empirical risk minimization论文中提出，最早用于图像分类，自BoF后用于目标检测的数据增强

在这里插入图片描述

Anchor Free: 做法类似于FCOS,每个feature map cell只预测一个输出框，相当于Anchor Number=1。同时为了得到更多的正样本，将Feature Map上cell落在目标中心附近一定范围的点都当作正样本，即Multiple Positive Sample。

欢迎访问个人网络日志🌹🌹知行空间🌹🌹

参考资料

1.https://zhuanlan.zhihu.com/p/392221567
2.https://zhuanlan.zhihu.com/p/394392992
3.https://zhuanlan.zhihu.com/p/397993315
4.丢弃Transformer，FCN也可以实现E2E检测

相关阅读:
2022-08-08 mysql慢SQL-Q18-10GB数据量-mysql/innodb测试
第2关：ZooKeeper配置
windows和linux下安装memcached
Linux(centos)服务器10秒快速配置Java环境
33K Star？这才是程序员需要的神器。。。
Linux的基础设置
【雷达通信】Matlab实现广义自适应多项式窗函数
信息学奥赛一本通1202：Pell数列
Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting 论文理解+机翻
头歌：集合——课上练

原文地址：https://blog.csdn.net/lx_ros/article/details/127273237