先阅读详解CLIP (二) | 简易使用CLIP-PyTorch预训练模型进行图像预测,了解CLIP的工作原理,是基于计算图像嵌入与文本嵌入距离的。
先将样例图保存为CLIP.png,放置在项目目录下。

先将CLIP项目所需的环境依赖安装好
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0 # pytorch和cuda版本仅为示例
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
如何查看本地安装的pytorch版本:
(torch) C:\Users\Administrator>conda list pytorch # packages in environment at E:\ProgramFiles\anaconda\envs\torch: # # Name Version Build Channel pytorch 1.11.0 py3.9_cuda11.3_cudnn8_0 pytorch
- 1
- 2
- 3
- 4
- 5
安装好环境后,运行如下代码:
import torch
import clip
from PIL import Image
# 来自https://zhuanlan.zhihu.com/p/524247403
# 需先按照教程安装环境依赖
device = "cuda" if torch.cuda.is_available() else "cpu"
# download to path "~/.cache/clip" by default
model, preprocess = clip.load("ViT-B/32", device=device)
# CLIP.png为本文中图一,即CLIP的流程图
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
# 将这三句话向量化
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
# image.shape: torch.Size([1, 3, 224, 224])
print("image.shape: ", image.shape)
# text.shape: torch.Size([3, 77])
print("text.shape: ", text.shape)
with torch.no_grad():
# image_features = model.encode_image(image) # 将图片进行编码
# text_features = model.encode_text(text) # 将文本进行编码
logits_per_image, logits_per_text = model(image, text)
# logits_per_image.shape torch.Size([1, 3])
print("logits_per_image.shape", logits_per_image.shape)
# logits_per_text.shape torch.Size([3, 1])
print("logits_per_text.shape", logits_per_text.shape)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]] # 图片"CLIP.png"对应"a diagram"的概率为0.9927937
第一次运行时,会下载模型参数到"~/.cache/clip",详见clip.load注释里的download_root参数,
代码的输出解释如下:
修改代码运行如下片段
with torch.no_grad():
# 将图片和文本进行编码
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# image_features.shape torch.Size([1, 512])
# text_features.shape torch.Size([3, 512])
print("image_features.shape", image_features.shape)
print("text_features.shape", text_features.shape)
可见输出如下:
可见,CLIP能够将图片和词语都转化为512长度的向量,从而能相互计算cosine距离。