Background
Motivation
如何更加有效地 adapt 预训练的Transformer 用于下游任务?
what is the best way to adapt large pre-trained Transformers to downstream tasks in terms of effectiveness and efficiency?
Contribution

Method
方法示意图

方法基础


Visual-Prompt Tuning (VPT)

Storing Visual Prompts
VPT is beneficial in presence of multiple downstream tasks. We only need to store the learned prompts and classification head for each task and re-use the original copy of the pre-trained Transformer model, significantly reducing the storage cost. For instance, given a ViT-Base with 86 million (M) parameters and
d
=
768
d = 768
d=768, 50 shallow prompts and deep prompts yield additional
p
×
d
=
50
×
768
=
0.038
p × d = 50 × 768 = 0.038
p×d=50×768=0.038M, and
N
×
p
×
d
=
0.46
N × p × d = 0.46
N×p×d=0.46M parameters, amounting to only 0.04% and 0.53% of all ViT-Base parameters, respectively.
Experiment result
实验设置特别关注部分:
实验结果

Even if storage is not a concern, VPT is a promising approach for adapting larger Transformers in vision. VPT-Deep outperforms all the other parameter-efficient tuning protocolsacross all task groups, indicating that VPTdeep is the best fine-tuning strategy in storage-constrained environments. Although sub-optimal than VPT-deep, VPT-shallow still offers non-trivial performance gain than head-oriented tuning methods, indicating that VPT-shallow is a worthwhile choice in deploying multi-task fine-tuned models if the storage constraint is severe.


The experiments are conducted on the ImageNet-21k supervised pre-trained Swin-Base. VPT continues to outperform other parameter-efficient fine-tuning methods (b, c) for all three subgroups of VTAB, though in this case Full yields the highest accuracy scores overall (at a heavy cost in total parameters).


Yet the accuracy drops if we insert prompts from top to bottom, suggesting that prompts at earlier Transformer layers matter more than those at later layers.


In the case of MoCo v3, VPT no longer holds the best performance, though it is still competitive with the others. This suggests that these two self-supervised ViTs are
fundamentally different from the supervised ones in previous sections. Exactly why and how these differences arise remain open questions.

We examine the idea of adding trainable parameters in the input space of ConvNets: padding both height and width by p p p learnable prompt pixels for the input image. Though this operation seems unconventional, we implement VPT this way given there is no obvious solution to add location-invariant prompts similar to the Transformer counterparts. In fact this approach has been explored before in the adversarial attack literature. VPT works well in a larger ConvNet backbone, ConvNeXt-B, offering accuracy gains over other sparse tuning protocols (b, c), and outperforming Full on 8 out of 19 cases. The advantages of VPT, however, diminish with smaller ConvNet (ResNet50), as there is no clear winner for all 19 VTAB-1k tasks.
Conclusion