入门小菜鸟,希望像做笔记记录自己学的东西,也希望能帮助到同样入门的人,更希望大佬们帮忙纠错啦~侵权立删。
目录
(2)ColumnParallelLinear(h->4h)
(4)RowParallelLinear(4h->h)+dropout
(1)ColumnParallelLinear —— h->4h+gelu激活
(2)RowParallelLinear —— 4h->h+dropout
✨下文中有关ColumnParallelLinear和RowParallelLinear的解说可以看看往期博文
MLP是“多层感知器”,也被称为前馈神经网络或人工神经网络(ANN)
原理图如下所示:

令f为激活函数,则有:

![]()
输入的shape:(b,s,h);
补:
b——batch_size;
s——sequence length;
h——hidden_size/number of partitions;
权重矩阵W的shape:(4hp,h);
所以 X*W^T 的shape为:(b,s,4hp);
偏置矩阵B的shape:(1,4hp);(默认每一列加的偏置都一样,即类似于偏置是b个(s,4hp),并且每一个的每一行的值都一样)
最终得到的shape为:(b,s,4hp);
然后通过激活函数gelu

指标准正态分布的概率函数(均值为0,方差为1)。其近似的表达式(代码采用)为:
![\operatorname{GELU}(\mathrm{x})=0.5 \mathrm{x}\left(1+\tanh \left[\sqrt{2 / \pi}\left(\mathrm{x}+0.044715 \mathrm{x}^{3}\right)\right]\right)](https://1000bd.com/contentImg/2023/09/06/104802569.png)
应用优点:使激活变换就会随机依赖于输入,当输入x越小时,会有一个更高的概率被dropout掉。
权重矩阵V的shape:(h,4hp);
所以 X*V^T 的shape为:(b,s,h);
偏置矩阵B的shape:(1,h);(默认每一列加的偏置都一样,即类似于偏置是b个(s,h),并且每一个的每一行的值都一样)
所以最终shape为:(b,s,h);
最后再来个dropout就完成了。
代码位置:model/mpu/sparse_transformer.py
- class GPT2ParallelMLP(torch.nn.Module):
- """MLP for GPT2.
- MLP will take the input with h hidden state, project it to 4*h
- hidden dimension, perform gelu transformation, and project the
- state back into h hidden dimension. At the end, dropout is also
- applied.
- Arguments:
- hidden_size: The hidden size of the self attention.
- output_dropout_prob: dropout probability for the outputs
- after self attention and final output.
- init_method: initialization method used for the weights. Note
- that all biases are initialized to zero and
- layernorm weight are initialized to one.
- output_layer_init_method: output layer initialization. If None,
- use `init_method`.
- """
- def __init__(self, hidden_size, output_dropout_prob, init_method,
- output_layer_init_method=None):
- super(GPT2ParallelMLP, self).__init__()
✨h->4h
- # Project to 4h.线性变换(从h变为4h)
- self.dense_h_to_4h = ColumnParallelLinear(hidden_size, 4*hidden_size,
- gather_output=False,
- init_method=init_method)
✨4h->h
- # Project back to h.
- self.dense_4h_to_h = RowParallelLinear(
- 4*hidden_size,
- hidden_size,
- input_is_parallel=True,
- init_method=output_layer_init_method)
self.dropout = torch.nn.Dropout(output_dropout_prob)
- def forward(self, hidden_states):
- # [b, s, 4hp]
- intermediate_parallel = self.dense_h_to_4h(hidden_states)
- intermediate_parallel = gelu(intermediate_parallel)
- # [b, s, h]
- output = self.dense_4h_to_h(intermediate_parallel)
- output = self.dropout(output)
- return output
补充:这里的gelu函数
- @torch.jit.script
- def gelu_impl(x):
- """OpenAI's gelu implementation."""
- return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * x *
- (1.0 + 0.044715 * x * x)))
-
- def gelu(x):
- return gelu_impl(x)
欢迎大家在评论区批评指正,谢谢~