对矩阵求导的理解可以借鉴我们高中熟悉的导数,在高中的时候我们都是对标量求导,标量其实也可以看成是一种特殊的1*1的矩阵。本文主要是为了记录机器学习中反向传播的过程,所以不对矩阵求导做过多的分析(事实上是我也不会,只会简单的)。
这里仅给出后面反向传播过程需要用到的一种矩阵求导的情形:
∂
(
a
T
x
)
∂
x
=
∂
(
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
)
∂
x
=
[
∂
(
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
)
∂
x
1
∂
(
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
)
∂
x
2
⋮
∂
(
a
1
x
1
+
a
2
x
2
+
⋯
+
a
n
x
n
)
∂
x
n
]
=
[
a
1
a
2
⋮
a
n
]
=
a
看懂这个我们就可以开始啦~

我们开始向后传播:
隐藏层第2层:
激活函数
:
d
a
[
2
]
=
∂
L
∂
a
[
2
]
激活函数: da^{[2]}=\frac{\partial L}{\partial a^{[2]}}
激活函数:da[2]=∂a[2]∂L
d z [ 2 ] = ∂ L ∂ z [ 2 ] = ∂ L ∂ a [ 2 ] ⋅ ∂ a [ 2 ] ∂ z [ 2 ] = d a [ 2 ] ⋅ g [ 2 ] ’ ( z [ 2 ] ) dz^{[2]}=\frac{\partial L}{\partial z^{[2]}}=\frac{\partial L}{\partial a^{[2]}}·\frac{\partial a^{[2]}}{\partial z^{[2]}}=da^{[2]}·g^{[2]’}(z^{[2]}) dz[2]=∂z[2]∂L=∂a[2]∂L⋅∂z[2]∂a[2]=da[2]⋅g[2]’(z[2])
d W [ 2 ] = ∂ L ∂ W [ 2 ] = ∂ L ∂ z [ 2 ] ⋅ ∂ z [ 2 ] ∂ W [ 2 ] = d z [ 2 ] ⋅ a [ 1 ] T ⇒ W [ 2 ] − = α ⋅ d W [ 2 ] dW^{[2]}=\frac{\partial L}{\partial W^{[2]}}=\frac{\partial L}{\partial z^{[2]}}·\frac{\partial z^{[2]}}{\partial W^{[2]}}=dz^{[2]}·a^{[1]T} \\ \Rightarrow W^{[2]}-=α·dW^{[2]} dW[2]=∂W[2]∂L=∂z[2]∂L⋅∂W[2]∂z[2]=dz[2]⋅a[1]T⇒W[2]−=α⋅dW[2]
d b [ 2 ] = ∂ L ∂ b [ 2 ] = ∂ L ∂ z [ 2 ] ⋅ ∂ z [ 2 ] ∂ b [ 2 ] = d z [ 2 ] ⇒ b [ 2 ] − = α ⋅ d b [ 2 ] db^{[2]}=\frac{\partial L}{\partial b^{[2]}}=\frac{\partial L}{\partial z^{[2]}}·\frac{\partial z^{[2]}}{\partial b^{[2]}}=dz^{[2]} \\ \Rightarrow b^{[2]}-=α·db^{[2]} db[2]=∂b[2]∂L=∂z[2]∂L⋅∂b[2]∂z[2]=dz[2]⇒b[2]−=α⋅db[2]
隐藏层第1层:
激活函数
:
d
a
[
1
]
=
∂
L
∂
a
[
1
]
=
∂
L
∂
z
[
2
]
⋅
∂
z
[
2
]
∂
a
[
1
]
=
W
[
2
]
T
⋅
d
z
[
2
]
激活函数: da^{[1]}=\frac{\partial L}{\partial a^{[1]}}=\frac{\partial L}{\partial z^{[2]}}·\frac{\partial z^{[2]}}{\partial a^{[1]}}=W^{[2]T}·dz^{[2]}
激活函数:da[1]=∂a[1]∂L=∂z[2]∂L⋅∂a[1]∂z[2]=W[2]T⋅dz[2]
说实话,这一步的计算结果我有点没懂:
∂ L ∂ z [ 2 ] \frac{\partial L}{\partial z^{[2]}} ∂z[2]∂L 是 d z [ 2 ] dz^{[2]} dz[2], ∂ z [ 2 ] ∂ a [ 1 ] \frac{\partial z^{[2]}}{\partial a^{[1]}} ∂a[1]∂z[2] 是 W [ 2 ] T W^{[2]T} W[2]T,为什么相乘的结果是 W [ 2 ] T ⋅ d z [ 2 ] W^{[2]T}·dz^{[2]} W[2]T⋅dz[2],而不是 d z [ 2 ] ⋅ W [ 2 ] T dz^{[2]}·W^{[2]T} dz[2]⋅W[2]T。
d z [ 1 ] = ∂ L ∂ z [ 1 ] = ∂ L ∂ a [ 1 ] ⋅ ∂ a [ 1 ] ∂ z [ 1 ] = d a [ 1 ] ⋅ g [ 1 ] ’ ( z [ 1 ] ) dz^{[1]}=\frac{\partial L}{\partial z^{[1]}}=\frac{\partial L}{\partial a^{[1]}}·\frac{\partial a^{[1]}}{\partial z^{[1]}}=da^{[1]}·g^{[1]’}(z^{[1]}) dz[1]=∂z[1]∂L=∂a[1]∂L⋅∂z[1]∂a[1]=da[1]⋅g[1]’(z[1])
d W [ 1 ] = ∂ L ∂ W [ 1 ] = ∂ L ∂ z [ 1 ] ⋅ ∂ z [ 1 ] ∂ W [ 1 ] = d z [ 1 ] ⋅ a [ 0 ] T ⇒ W [ 1 ] − = α ⋅ d W [ 1 ] dW^{[1]}=\frac{\partial L}{\partial W^{[1]}}=\frac{\partial L}{\partial z^{[1]}}·\frac{\partial z^{[1]}}{\partial W^{[1]}}=dz^{[1]}·a^{[0]T} \\ \Rightarrow W^{[1]}-=α·dW^{[1]} dW[1]=∂W[1]∂L=∂z[1]∂L⋅∂W[1]∂z[1]=dz[1]⋅a[0]T⇒W[1]−=α⋅dW[1]
d b [ 1 ] = ∂ L ∂ b [ 1 ] = ∂ L ∂ z [ 1 ] ⋅ ∂ z [ 1 ] ∂ b [ 1 ] = d z [ 1 ] ⇒ b [ 1 ] − = α ⋅ d b [ 1 ] db^{[1]}=\frac{\partial L}{\partial b^{[1]}}=\frac{\partial L}{\partial z^{[1]}}·\frac{\partial z^{[1]}}{\partial b^{[1]}}=dz^{[1]} \\ \Rightarrow b^{[1]}-=α·db^{[1]} db[1]=∂b[1]∂L=∂z[1]∂L⋅∂b[1]∂z[1]=dz[1]⇒b[1]−=α⋅db[1]
第l层:
激活函数
:
d
a
[
l
]
=
∂
L
∂
a
[
l
]
=
∂
L
∂
z
[
l
+
1
]
⋅
∂
z
[
l
+
1
]
∂
a
[
l
]
=
W
[
l
+
1
]
T
⋅
d
z
[
l
+
1
]
激活函数: da^{[l]}=\frac{\partial L}{\partial a^{[l]}}=\frac{\partial L}{\partial z^{[l+1]}}·\frac{\partial z^{[l+1]}}{\partial a^{[l]}}=W^{[l+1]T}·dz^{[l+1]}
激活函数:da[l]=∂a[l]∂L=∂z[l+1]∂L⋅∂a[l]∂z[l+1]=W[l+1]T⋅dz[l+1]
d z [ l ] = ∂ L ∂ z [ l ] = ∂ L ∂ a [ l ] ⋅ ∂ a [ l ] ∂ z [ l ] = d a [ l ] ⋅ g [ l ] ’ ( z [ l ] ) ⇒ d z [ l ] = W [ l + 1 ] T d z [ l + 1 ] ⋅ g [ l ] ’ ( z [ l ] ) dz^{[l]}=\frac{\partial L}{\partial z^{[l]}}=\frac{\partial L}{\partial a^{[l]}}·\frac{\partial a^{[l]}}{\partial z^{[l]}}=da^{[l]}·g^{[l]’}(z^{[l]}) \\ \Rightarrow dz^{[l]}=W^{[l+1]T}dz^{[l+1]}·g^{[l]’}(z^{[l]}) dz[l]=∂z[l]∂L=∂a[l]∂L⋅∂z[l]∂a[l]=da[l]⋅g[l]’(z[l])⇒dz[l]=W[l+1]Tdz[l+1]⋅g[l]’(z[l])
d W [ l ] = ∂ L ∂ W [ l ] = ∂ L ∂ z [ l ] ⋅ ∂ z [ l ] ∂ W [ l ] = d z [ l ] ⋅ a [ l − 1 ] T ⇒ W [ l ] − = α ⋅ d W [ l ] dW^{[l]}=\frac{\partial L}{\partial W^{[l]}}=\frac{\partial L}{\partial z^{[l]}}·\frac{\partial z^{[l]}}{\partial W^{[l]}}=dz^{[l]}·a^{[l-1]T} \\ \Rightarrow W^{[l]}-=α·dW^{[l]} dW[l]=∂W[l]∂L=∂z[l]∂L⋅∂W[l]∂z[l]=dz[l]⋅a[l−1]T⇒W[l]−=α⋅dW[l]
d b [ l ] = ∂ L ∂ b [ l ] = ∂ L ∂ z [ l ] ⋅ ∂ z [ l ] ∂ b [ l ] = d z [ l ] ⇒ b [ l ] − = α ⋅ d b [ l ] db^{[l]}=\frac{\partial L}{\partial b^{[l]}}=\frac{\partial L}{\partial z^{[l]}}·\frac{\partial z^{[l]}}{\partial b^{[l]}}=dz^{[l]} \\ \Rightarrow b^{[l]}-=α·db^{[l]} db[l]=∂b[l]∂L=∂z[l]∂L⋅∂b[l]∂z[l]=dz[l]⇒b[l]−=α⋅db[l]