今2023年2月,之前开发stable diffusion最初版本的Runway提出了首个AI编辑模型Gen-1,Gen-1可以在原视频的基础上,编辑出咱们想要的视频。无论是粗糙的3D动画,还是用手机拍出来的摇摇晃晃的视频,Gen-1都可以升级出一个不可思议的效果(当然,其背后原因是Gen1 trained jointly on images and videos)
如下图所示,可以基于潜在视频扩散模型(latent video diffusion models),通过给定「下图中间部分」的原始输入图像,然后既可以通过如「下图左上角」的文字引导生成视频,也可以通过如「下图左下角」的图像引导生成视频
怎么做到的呢?
首先,视频之所以可以通过文字引导生成,离不开文字引导图像生成的那一系列前置工作(Text-conditioned models, such as DALL-E2 and Stable Diffusion,enable novice users to generate detailed imagery given only a text prompt as input)。毕竟潜在扩散模型提供了“在感知压缩空间高效合成图像”的方法
其次,通过引入带有时间线的预训练图像模型(temporal layers into a pre-trained image model),且在图像和视频上做联合训练「即在一个大规模的无字幕视频,和配对的“文本-图像”的数据集上进行训练( trained on a large-scale dataset of uncaptioned videos and paired text-image data)」,从而将潜在扩散模型扩展到视频生成
Gen1提出了一个可控的结构和内容感知的视频扩散模型(We propose a controllable structure and content-aware video diffusion model) 同时,在推理阶段可以修改由示例图像或文本引导的视频(意味着编辑视频的操作完全在推理阶段中执行,无需额外的针对每个视频的训练或预处理,即Editing is performed entirely at inference time without additional per-video training or pre-processing)
且选择用单眼深度估计的技术来表示结构,且由预先训练的神经网络预测嵌入表示内容(We opt to represent structure with monocular depth estimates and content with embeddings predicted by a pre-trained neural network,顺带解释下:单眼深度估计是一种计算机视觉技术,它旨在从仅使用单个摄像机拍摄的二维图像中推断出场景的三维深度信息)
然后在视频生成的过程中提供了几种控制模式 首先,类似于image synthesis models,训练模型,使得其可以推断视频的内容,例如他们的外观或风格,及匹配用户提供的图像或文本提示 第二,受到扩散过程的启发,将information obscuring process应用到structure representation,以选择模型对给定结构的坚持程度(we apply an information obscuring process to the structure representation to enable selecting of how strongly the model adheres to the given structure) 最后,还对推理过程进行了调整,通过自定义指导方法,以及受classifier-free guidance的启发,以控制生成的剪辑的时间一致性(to enable control over temporal consistency in generated clips),相当于做到了时间、内容、结构三者在一致上的统一对齐
为了实现这一目标,需要基于结构表示s" role="presentation" style="position: relative;">s和内容表示c" role="presentation" style="position: relative;">c的基础上学习视频x" role="presentation" style="position: relative;">x的生成模型,从而通过输入的视频推断出其结构表示s" role="presentation" style="position: relative;">s,然后根据编辑视频的描述文本c" role="presentation" style="position: relative;">c进行修改(modify it based on a text prompt c describing the edit),如下图所示
在上图左侧的训练过程中,输入的视频x用一个固定的编码器E" role="presentation" style="position: relative;">E 编码到,并扩散到zt" role="presentation" style="position: relative;">zt 另一边,通过对“使用MiDaS获得的depth maps”进行编码,来提取一个结构表示S" role="presentation" style="position: relative;">S,并通过使用CLIP对其中一个帧进行编码,来提取内容表示C" role="presentation" style="position: relative;">C (We extract a structure representation s by encoding depth maps obtained with MiDaS, and a content representation c by encoding one of the frames with CLIP. ) 然后,在S" role="presentation" style="position: relative;">S、zt" role="presentation" style="position: relative;">zt、以及通过交叉注意块提供的C" role="presentation" style="position: relative;">C的帮助下,模型学习在潜在空间中逆转扩散过程
在上图右侧的推理过程中,输入视频的结构S" role="presentation" style="position: relative;">S以同样的方式提供。为了通过文本指定内容,将CLIP文本嵌入转换为图像嵌入(To specify content via text, we convert CLIP text embeddings to image embeddings via a prior)
引入时间层来扩展图像架构,且这些时间层仅对视频输入有效,另自动编码器保持固定并独立处理视频中的每一帧 we extend an image architecture by introducing temporal layers, which are only active for video inputs. All other layers are shared between the image and video model. The autoencoder remains fixed and processes each frame in a video independently.
UNet主要由两个模块组成:残差块和transformer块,通过添加跨时间的一维卷积和跨时间的一维自注意力将它们扩展到视频(we extend them to videos by adding both 1D convolutions across time and 1D self-attentions across time)
在每个残差块中,如上图左侧所示,在每个2D卷积之后引入一个时间卷积(In each residual block, we introduce one temporal convolution after each 2D convolution) 同样的,如上图右侧所示,在每个2D transformer块后,都包含一个temporal 1D transformer block(which mimics its spatial counterpart along the time axis),且将learnable positional encodings of the frame index输入到temporal transformer blocks中
相当于 →" role="presentation" style="position: relative;">→ 在每个空间卷积之后添加一个一维时间卷积(注意,是空间卷积 VS 时间卷积) →" role="presentation" style="position: relative;">→ 在每个空间注意力层之后添加一个一维时间注意力层(注意,是空间注意力 VS 时间注意力)
最终实现时,将图像视为只有单帧的视频,以统一处理这两种情况 批量大小为b、帧数为n、通道数为c、空间分辨率为w ✖️ h,即形状为b × n × c × h × w的分批张量,被重新排列为「w × h (i.e. shape b × n × c × h × w) is rearranged to」 (b · n) × c × h × w for spatial layers, to (b · h · w) × c × n for temporal convolutions, and to (b · h · w) × n × c for temporal self-attention
//待更
1.1.2.3 结构与内容的表示(Representing Content and Structure)
总之,我们的目标是根据用户提供的编辑视频的文本提示来编辑视频,但还是面临一个问题:即我们没有视频三元组的训练数据、编辑prompt、和生成的输出,也没有成对的视频和文本字幕(Thus, while our goal is to edit an input video based on a text prompt describing the desired edited video, we have neither training data of triplets with a video, its edit prompt and the resulting output, nor even pairs of videos and text captions)
与当今许多生成式 AI 模型不同,Emu Edit 可以精确遵循指令,确保输入图像中与指令无关的像素保持不变。例如,下图左侧,用户给出指令「将草地上的小狗移除」,移除物体后的图片几乎看不出来有什么变化,再比如下图右侧,移除图片中左下角的文本,再给图片换个背景,Emu Edit 也能处理得很好:
2.1.2 开发一个1000万规模的数据集,涵盖16个不同的任务
考虑到市面上已有的数据规模、多样性、质量都有限,故为了训练这个模型,Meta 开发了一个包含 16个不同的任务 和 1000 万个合成样本的数据集,每个样本都包含一个输入图像、对要执行任务的描述(即文本指令),以及目标输出图像、任务索引「Each example (cI , cT , x, i)in our dataset, contains an input image cI , a text instruction cT , a target image x, and a task index i (out of the sixteen)」,具体而言:
任务列表 这16个任务分为三个主要类别:基于区域的编辑、自由形式的编辑、视觉任务
Region-Based Editing
Local: Substituting one object for another, altering an object’s attributes (e.g., “make it smile”)
Remove: Erasing an object from the image
Add: Inserting a new object into the image
Texture: Altering an object’s visual characteristics with out affecting its structure (e.g., painting over, filling or
covering an object)
Background: Changing the scene’s background
Free-Form Editing
Global: Edit instructions that affect the entire image, or that can not be described using a mask (e.g., “let’s see it
in the summer”)
Style: Change the style of an image
Text Editing: This involves text-related editing tasks
such as adding, removing, swapping text, and altering the
text’s font and color
Vision tasks
Detect: Identifying and marking a specific object with in the image with a rectangle bounding box
Segment: Isolating and marking an object in the image
Color: Color adjustments like sharpening and blurring
Image-to-Image Translation: Tasks that involve bi directional image type conversion, such as sketch-to image, depth map-to-image, normal map-to-image,pose to-image,segmentation map-to-image, and so on
图像对生成(Image Pairs Generation) 在创建一对输入和编辑图像时,一个至关重要的先决条件是保证两个图像只在特定的元素或位置上不同,而在所有其他方面保持相同。以往基于指令的图像编辑方法依赖于Prompt-to-Prompt (P2P)来构建图像编辑数据集 A crucial prerequisite when creating a pair of input and edited images is to guarantee that the two images differ only in specific elements or locations, while remaining identical in all other aspects. Previous instruct-based image editing methods [Instructpix2pix: Learning to follow image editing instructions] rely on Prompt-to-Prompt (P2P) to build an image-editing dataset. P2P injects cross-attention maps from the input image generation to the edited image generation.
为了支持局部编辑,P2P还基于cross-attention maps去近似编辑部分的掩码,并将编辑限制在该局部区域 P2P依赖于输入图像标题和编辑图像标题之间的字对字对齐(比如“一只猫骑自行车”和“一只猫骑汽车”),来生成编辑图像对 然而,当没有单词到单词对齐时,由于依赖cross-attention maps,生成的掩码往往不精确 To support local edits, P2P additionally approximates a mask of the edited part,based on the cross-attention maps and constrains the edit to this local area. P2P relies on word-to-word alignment between the input image caption and the edited image caption (e.g. "a cat riding a bicycle" and "a cat riding a car") to produce editing image pairs. However, when there is no word-to-word alignment, the resulting mask tends to be imprecise due to its reliance on cross-attention maps.
此外,由于在大多数图像编辑任务中,单词到单词对齐不是一个实用的假设,这种方法通常无法保留structure and identity 为应对这一挑战,本文提出一种掩码提取方法,应用于编辑过程之前 我们的方法涉及: (i)通过LLM从编辑指令中识别编辑区域,并在生成图像之前创建相应的掩码 以及(ii)在编辑过程中集成这些掩码,以确保编辑区域与原始图像的无缝融合 Furthermore, as word-to-word alignment is not a practical assumption in most of the image editing tasks, this approach often fails to preserve structure and identity. To address this challenge, we propose a mask extraction method, which is applied before the editing process. Our approach involves: (i) identifying the edited areas from the editing instruction via an LLM and creating corresponding masks before image generation, and (ii) integrating these masks during the editing process to ensure seamless fusion of edited regions with the original image.
此外,还利用了包括膨胀和高斯模糊等技术,来完善蒙版 We utilize various tech-niques, including dilation and Gaussian blurring, to refinethe masks
且采用了一种全面的过滤方法来确保数据集的保真度(We employ a comprehensive filtering approachto ensure the fidelity of the dataset) 这包括: (i)使用任务预测器,用应该属于另一个任务的指令重新分配样本,即us-ing the task predictor (Sec. 4.2) to reassign samples withinstructions that should belong to another task (ii)应用CLIP滤波指标[2],即apply-ing CLIP filtering metrics [2] (iii)基于输入图像的深度图和编辑图像之间的L1距离使用结构保留滤波,即employing structure pre-serving filtering based on the L1 distance between the depthmap of the input image and the edited image (iv)应用图像检测器根据指令中指定的对象验证元素的存在(在Add任务中)、缺失(在Remove任务中)或替换(在Local任务中),即apply-ing image detectors to validate the presence (in Add task),the absence (in Remove task) or replacement (in Local task)of elements, according to the objects specified in the in-struction
这个过程过滤掉了70%的数据,产生了一个包含1000万个样本的最终数据集
2.1.3 模型架构:基于潜在扩散模型先预训练、后通过几千张带标注的图像做微调
和经典老方法一样,先做预训练,然后微调(The Emu model is a two-stage approach that begins with a pre-training phase and concludes with aquality fine-tuning stage)。该方法的关键在于,微调数据集相对较小,只包含几千张图像,但必须具有非凡的质量,通常需要人工标注
Emu采用了潜在扩散模型架构[High-resolution image synthesis with latent diffusion models],支持高分辨率图像生成,并与编码器E和解码器d合并了一个16通道自动编码器 support high-resolution image generation and incorporated a 16-channel autoencoder with encoder E and decoder D.
一个大型U-Net,ϵθ,包含28亿个参数,θ,来自CLIP ViT-L和T5-XXL的文本嵌入,以及包含11亿张图像的大量预训练数据集,促进了模型学习复杂语义和更精细细节的能力,噪声偏移策略(noise-offset strategy)有助于生成高对比度和美观的图像 A large U-Net, ϵθ, with 2.8 billion parameters, θ, text embeddings from CLIP ViT-L [18] and T5-XXL [19], and a substantial pre-training dataset of 1.1 billion images facilitate the model’s ability to learn complex semantics and finer details, with a noise-offset strategy contributing to high-contrast and aesthetically pleasing image generation.
Given the encoded latent of an image z = E(x), the diffusion process generates a noisy latent zt where the noise level increases over timesteps t ∈ T. 为了将Emu转换为基于指令的图像编辑模型,我们以要修改的图像cI和指令cT为条件(To convert Emu to an instructionbased image editing model, we condition it on the image to be modified cI and the instruction cT )
Emu Edit需要最小化下面的优化问题 其中∈N(0,1)是由扩散process和y =(cT,cI,x)是指令、输入图像和来自数据集的目标图像的三元组。在实践中,我们用Emu的权值来初始化Emu Edit的权值 where ϵ ∈ N(0, 1) is the noise added by the diffusion process and y = (cT , cI , x) is a triplet of instruction, input image and target image from the dataset. In practice, we initialize the weights of Emu Edit with the weights of Emu
其次,为了有效地处理各种各样的任务,引入了学习任务嵌入(learned task embeddings)的概念,用于引导生成过程朝着正确的生成任务方向发展 Second, to process this wide array of tasks effectively,we introduce the concept of learned task embeddings,which are used to steer the generation process toward the correct generative task.
具体来说,对于每个任务,都学习一个独特的任务嵌入向量,并通过交叉注意力交互将其集成到模型中,并将其添加到时间步嵌入中(we learn a unique task embedding vector, and integrate it into the model through cross-attention interactions, and by adding it to the timestep embeddings)
在训练期间,给定我们数据集中的一个样本,我们使用任务索引i" role="presentation" style="position: relative;">i,从嵌入表中获取任务的嵌入向量vi" role="presentation" style="position: relative;">vi,并与模型权重联合优化它(we use the task index, i, to fetch the task’s embedding vector, vi, froman embedding table, and optimize it jointly with the modelweights)
具体而言,我们通过交叉注意交互将任务嵌入到U-Net中,并将其添加到时间步长嵌入中(We do so by introducing the task embedding vias an additional condition to the U-Net, ϵθ. Concretely,we integrate the task embedding into the U-Net via cross-attention interactions, and by adding it to the timestep em-beddings)
那用什么样的文本到图像模型来做初始化呢?我们将文本到图像的U-Net架构用于我们的模型,并使用预训练的T2l模型初始化所有空间参数。该模型同时使用冻结的T5-XL和冻结的CLIP文本编码器从文本提示符中提取特征。U-Net中单独的cross-attention层负责每个文本特征。在初始化之后,模型包含2.7B被冻结的空间参数,以及1.7B被学习的时间参数 The model uses both a frozen T5-XL [15]and a frozen CLIP [58] text encoder to extract features fromthe text prompt. Separate cross-attention layers in the U-Net attend to each of the text features. After initialization,our model contains 2.7B spatial parameters which are kept frozen, and 1.7B temporal parameters that are learned
由于视频 - 文本数据集比图像 - 文本数据集要小得多,研究者还使用权重冻结的预训练文本 - 图像(T2I)模型初始化了他们的文本 - 视频模型 且他们确定了关键的设计决策 —— 改变扩散噪声调度和多阶段训练(adjusted noiseschedules for diffusion, and multi-stage training) —— 该方法支持直接生成 512px 的高分辨率视频,不需要先前方法中使用的一些深度级联模型(without requiring a deep cascade of models as inprior work)
再说一下更多细节
我们用预训练的文本到图像模型初始化F,以确保它能够在初始化时生成图像 由于是从预训练的T2I模型初始化并保持冻结状态的,因此我们的模型保留了从大型图像-文本数据集中学习到的概念和风格多样性,并使用它来生成i。这不需要额外的训练成本,而不像Imagen video那样对图像和视频数据进行联合微调以保持这种风格 Since the spatial layers are initialized from a pretrained T2I model and kept frozen, our model retains the conceptual and stylistic diversity learned from large image-text datasets, and uses it to generate I. This comes at no additional training cost unlike approaches [Imagen video] that do joint finetuning on image and video data to maintain such style
当然,许多直接的T2V方法[比如Align your latents: High-resolution video synthesis with latent diffusion models,再比如Make-a-video: Text-to-video generation without text-video data]也从预训练的T2I模型初始化,并保持空间层冻结。然而,它们没有采用我们基于图像的因子分解,因此不能保留T2I模型的质量和多样性
Many direct T2V ap-proaches [7, 68] also initialize from a pretrained T2I modeland keep the spatial layers frozen. However, they do notemploy our image-based factorization and thus do not re-tain the quality and diversity in the T2I model
由于使用潜在扩散模型,所以首先使用按帧应用的图像自动编码器将视频V转换为潜在空间X∈R T ×C×H×W,这降低了空间维度 再之后,利用自动编码器的解码器,可以将潜空间转换回像素空间(The latent space can be converted back to the pixel spaceusing the autoencoder’s decode) 视频的T帧被独立去噪,以产生去噪输入Xt,扩散模型被训练去噪(The T frames of the videoare noised independently to produce the noised input Xt,which the diffusion model is trained to denoise)
预训练的T2I模型已经是文本条件,结合上面描述的图像条件,F同时是文本和图像条件 The pretrained T2I model is already text conditioned and combined with the image conditioning described above,Fis conditioned on both text and image
最终如此操作带来的好处是
与直接用文本生成视频的方法不同,他们的分解方法在推理时会显式地生成一张图像,这使得他们能够轻松保留文生图模型的视觉多样性、风格和质量,如下图所示 这使得 EMU VIDEO 即使在训练数据、计算量和可训练参数相同的情况下,也能超越直接 T2V 方法
且比如通过多阶段的训练方法,文生视频的生成质量可以得到大幅提高
2.2.2 如何延长生成视频的时长
从展示的 demo 中可以看到,EMU VIDEO 已经可以支持 4 秒的视频生成。在论文中,他们还探讨了增加视频时长的方法
作者表示,通过一个小的架构修改,他们可以在 T 帧上约束模型并扩展视频。因此,他们训练 EMU VIDEO 的一个变体,以「过去」16 帧为条件生成未来 16 帧。在扩展视频时,他们使用与原始视频不同的未来文本提示,效果如图 7 所示。他们发现,扩展视频既遵循原始视频,也遵循未来文本提示。
而PixelDance是一种基于潜在扩散模型的视频生成方法,以<文本,第一帧,最后一帧>指令为条件(conditioned on instructions)
文本指令由预训练的文本编码器编码,并与交叉注意力集成到扩散模型中
图像指令使用预训练的VAE编码器进行编码,并与perturbed video latents或高斯噪声连接,作为扩散模型的输入 The image instructions are encoded with a pretrained VAE encoder and concatenated with either perturbed video latents or Gaussian noise as the input to the diffusion model
在训练中,我们使用(ground-truth)第一帧来强制模型严格遵守指令(For the firstf rame instruction, we employ the ground-truth first frame for training, making the model adhere to the first frame in-struction strictly in inference),保持连续视频片段之间的连续性。在推理中,这个指令可以方便地从T2I模型[32]中获得,也可以直接由用户提供
但最后一帧怎么获取呢?因为最后一帧跟第一帧不同,为此,他们开发了三种技术
首先,在训练中,从视频剪辑的最后三帧(ground-truth)帧中随机选择最后一帧指令 First, the last frame instruction is randomly selected from thelast three (ground-truth) frames of a video clip
其次,在指令中引入噪声,以减轻对指令的依赖,并提升模型的鲁棒性 Second, we introduce noise to the instruction to mitigate the reliance onthe instruction and promote the robustness of model
相当于用噪声扰动图像指令的编码潜c image(we perturb the encoded latents cimage of imageinstructions with noise)
第三,在训练中以一定的概率(例如25%)随机丢弃最后一帧指令。相应地,他们提出了一个简单而有效的推理采样策略 we randomly drop the last frame instruction with a certainprobability, e.g. 25%, in training.
在第一个τ" role="presentation" style="position: relative;">τ去噪步骤中,利用最后一帧指令来指导视频生成朝着期望的结束状态发展 During the first τ denoising steps, the last frame instruc-tion is utilized to guide video generation towards the desiredending status.
然后,在剩余的步骤中,指令被丢弃,允许模型生成更多时间上连贯的视频(长达三分钟)。最后一帧指令的影响可以由τ" role="presentation" style="position: relative;">τ 调整 Then, during the remaining steps, the instruc-tion is dropped, allowing the model to generate more tem-porally coherent video. The impact of last frame instructioncan be adjusted by τ.
自回归方法(autoregressive methods)[15,22,41]采用滑动窗口来生成以前一片段为条件的新片段 1) autoregressive methods [15, 22, 41] employ a sliding window to generate a new clip conditioned on the previous clip
最终,为了生成长视频,PixelDance被训练为严格地遵循第一帧指令,其中前一个视频片段的最后一帧(the last frame from preceding clip),被用作生成后续片段的第一帧指令(is used as the first frame instruction for generating the subsequent clip)
3.2.2 PixelDance的架构:基于2D UNet插入时间和文本指令 + 图像指令注入
我们采用广泛使用的2D UNet作为扩散模型,该模型由一系列空间下采样层和一系列插入跳跃连接的空间上采样层构成(We take the widely used 2D UNetas diffusion model, which is constructed with a series of spatial downsampling layers followed by a series of spatial upsampling layers with inserted skip connections)
具体来说,它是由两个基本块构建的,即2D卷积块和2D注意力块。我们通过插入时间层将2D UNet扩展到3D变体,其中2D卷积层之后是沿时间维的1D卷积层,2D注意层之后是沿时间维的1D注意层(此点和Edit video一样,都遵循「上文1.1.2.2 时空潜在扩散(Spatio-temporal Latent Diffusion)」的所述) Specifically, it is built with two basic blocks, i.e., 2D convolution block and 2D attention block We extend the 2D UNet to 3D variant with inserting temporal layers [22], where 1D convolution layer along temporal dimension after 2D convolution layer, and 1D attention layer along temporal dimension following 2D attention layer.
该模型可以与图像和视频联合训练,以保持空间维度上的高保真生成能力。对于图像输入,1D时序操作(1D temporal operations)是禁用的 我们在所有时间注意力层中使用双向自注意力,且使用预训练的CLIP文本编码器对文本指令进行编码,the embedding c text通过UNet中的交叉注意层注入,隐藏状态作为查询(query),c text作为键key和值value(hidden states as queries and c text as keys and values) The model can be trained jointly with images and videos to maintain high-fidelity generation ability on spatial dimension. The 1D temporal operations are disabled for image input. We use bi-directional self-attention in all temporal attention layers. We encode the text instruction using a pre-trained CLIP text encoder [30], and the embedding c text is injected through cross-attention layers in the UNet with hidden states as queries and c text as keys and values
图像指令注入(Image Instruction Injection) 我们将第一帧和最后一帧的图像指令与文本指令结合在一起。我们利用ground-truth视频帧作为训练中的指令。 给定第一帧和最后一帧的图像指令,表示为{Ifirst ,Ilast }" role="presentation" style="position: relative;">{Ifirst ,Ilast },我们首先使用VAE将它们编码到扩散模型的输入空间中,得到,其中f∈RC×H×W" role="presentation" style="position: relative;">f∈RC×H×W We incorporate image instructions for both the first and last frames in conjunction with text instruction. We utilize ground-truth video frames as the instructions in training. Given the image instructions on the first and last frame, denoted as{If irst, Ilast}, we first encode them into the input space of diffusion models using VAE, result in {ff irst,flast} where f ∈ RC×H×W
为了在不损失时间位置信息的情况下注入指令,那么最终的图像条件被构建为(To inject the instructions without loss of thetemporal position information, the final image condition isthen constructed as): cimage =[ffirst ,PADs,flast ]∈RF×C×H×W" role="presentation" style="position: relative;">cimage =[ffirst ,PADs,flast ]∈RF×C×H×W
where PADs∈R(F−2)×C×H×W" role="presentation" style="position: relative;">PADs∈R(F−2)×C×H×W . The condition c image和noised latent zt" role="presentation" style="position: relative;">zt 沿着通道维度(along the channel dimension)连接起来, 从而作为扩散模型的输入
因此,我们用其他自收集的500K无水印视频片段来扩展我们的训练数据,这些视频片段描述了现实世界的实体,如人类、动物、物体和景观,并与粗粒度的文本描述配对。尽管只包含适度的比例,但将该数据集与WebVid-10M相结合进行训练可以确保PixelDance能够在图像指令不含水印的情况下生成无水印视频(we surprisingly find that combining this dataset with WebVid-10M for training ensures that PixelDance is able to generate watermark-free videos if the image instructions are free of watermarks)
PixelDance在「视频-文本数据集」和「图像-文本数据集」上进行联合训练(PixelDance is trained jointly on video-text dataset and image-text dataset),具体而言
对于视频数据,我们随机采样每个视频4 fps的16个连续帧。继之前的工作(Imagen video: High definition video generation with diffusion models),采用LAION-400M作为图像-文本数据集。图像-文本数据每8次训练迭代使用一次(Image-text data are utilized every 8 training iterations)
文本到图像的图像预训练(image pretraining),即2D文本到图像扩散模型,比如SDXL的工作:Improving Latent Diffusion Models for High-Resolution Image Synthesis
规模比较大但低分辨率的视频数据集上的视频预训练(video pretraining on a large dataset at low resolu-tion) 我们收集了一个长视频的初始数据集,它构成了我们的视频预训练阶段的基础数据。然后To avoid cuts andfades leaking into synthesized videos, we apply a cut detec-tion pipeline1 in a cascaded manner at three different FPSlevels. Figure 2, left, provides evidence for the need for cutdetection: After applying our cut-detection pipeline, we ob-tain a significantly higher number (∼4×) of clips, indicat-ing that many video clips in the unprocessed dataset containcuts beyond those obtained from metadata.
接下来,我们用三种不同的合成字幕方法注释每个剪辑:首先,我们使用图像标题器CoCa来注释每个剪辑的中间帧,并使用V-BLIP以获得基于视频的字幕。最后,我们通过对前两个字幕进行基于llm的摘要来生成该剪辑的第三个描述 Next, we annotate each clip with three different syn-thetic captioning methods: First, we use the image captionerCoCa [103] to annotate the mid-frame of each clip and use V-BLIP [104] to obtain a video-based caption. Finally, wegenerate a third description of the clip via an LLM-basedsummarization of the first two captions.
规模较小但具备高质量的高分辨率的视频数据集上的视频微调(high-resolution video fine tuning on a much smallerdataset with higher-quality videos) 具体而言,他们借鉴了潜图像扩散模型[12,60]的训练技术,并增加了训练示例的分辨率。此外,我们使用了一个小型微调数据集,其中包括250K高视觉保真度的预字幕视频片段 Here, we draw on training tech-niques from latent image diffusion modeling [12, 60] andincrease the resolution of the training examples. More-over, we use a small finetuning dataset comprising 250Kpre-captioned video clips of high visual fidelity
总之,SVD基于Stable Diffusion 2.1,首先用约6亿个样本的视频数据集预训练了基础模型(we apply our proposed curation scheme toa large video dataset comprising roughly 600 million sam-ples and train a strong pretrained text-to-video base model)
然后在较小的高质量数据集上对基础模型进行微调,用于高分辨率的下游任务(finetune the base model on a smaller, high-qualitydataset for high-resolution downstream tasks )
在每次迭代(k次)的开始,初始化一个由3D形状(用绿色表示)组成的窗口,然后,这些形状被分发到p个GPU上进行并行计算,在GPU上并行计算形状的SDS/VSD梯度 Starting from top left, for iteration k, we initialize a window of 3D shapes (in green) with dimension D and dispatch them to p GPUs for parallelly computing the SDS/VSD gradients,
然后根据下述公式中的规则收集这些梯度「which are gathered for rollout using the rule in Eq. (9)」,并使用这些梯度对形状进行更新 θτk=h†(s(θτ−1k−1),h†(s(θτ−2k−1),…h†(s(θ0k−1),θ0k−1)))" role="presentation" style="position: relative;">θkτ=h†(s(θk−1τ−1),h†(s(θk−1τ−2),…h†(s(θk−10),θk−10)))
将迭代k + 1得到的形状(橙色)与迭代k得到的形状进行比较,窗口向前滑动,直到该时间步的误差不小于阈值e,阈值e根据窗口的平均/中值误差进行自适应更新 The resulting shapes (in orange) for iteration k + 1 are compared to those in iteration k. The window is slid forward until the error at that time step is not smaller than the threshold e, which is adaptively updated with the mean/median error of the window
另外,在VSD的情况下,研究人员会在所有GPU上保留LoRA diffusion的独立副本,这些副本会独立更新,无需额外通信 Optionally, in the case of VSD, we keep independent copies of LoRA diffusion on all GPUs which are updated independently without extra communication.