f ( x ) = sign ( w ⋅ x + b ) f(x)=\operatorname{sign}(w \cdot x+b) f(x)=sign(w⋅x+b)
思考:
感知机的缺陷:
P
(
Y
=
1
∣
x
)
=
exp
(
w
⋅
x
)
1
+
exp
(
w
⋅
x
)
P
(
Y
=
0
∣
x
)
=
1
1
+
exp
(
w
⋅
x
)
参数估计:
Logistic regression模型学习时,对于给定的训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\right.\left.\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中, x i ∈ R n , y i ∈ { 0 , 1 } x_{i} \in \mathbf{R}^{n}, \quad y_{i} \in\{0,1\} xi∈Rn,yi∈{0,1},可以应用极大似然估计法估计模型参数,从而得到logistic regression模型。
设:
P
(
Y
=
1
∣
x
)
=
π
(
x
)
,
P
(
Y
=
0
∣
x
)
=
1
−
π
(
x
)
P(Y=1 \mid x)=\pi(x), \quad P(Y=0 \mid x)=1-\pi(x)
P(Y=1∣x)=π(x),P(Y=0∣x)=1−π(x)
似然函数为:
∏
i
=
1
N
[
π
(
x
i
)
]
y
i
[
1
−
π
(
x
i
)
]
1
−
y
i
\prod_{i=1}^{N}\left[\pi\left(x_{i}\right)\right]^{y_{i}}\left[1-\pi\left(x_{i}\right)\right]^{1-y_{i}}
i=1∏N[π(xi)]yi[1−π(xi)]1−yi
对数似然函数为:
L
(
w
)
=
∑
i
=
1
N
[
y
i
log
π
(
x
i
)
+
(
1
−
y
i
)
log
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
log
π
(
x
i
)
1
−
π
(
x
i
)
+
log
(
1
−
π
(
x
i
)
)
]
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
i
)
−
log
(
1
+
exp
(
w
⋅
x
i
)
]
对
L
(
w
)
L(w)
L(w)求极大值,得到
w
w
w的估计值。
似然函数对
w
w
w的求导:
L
(
w
)
=
∑
i
=
1
N
[
y
i
(
w
⋅
x
i
)
−
log
(
1
+
exp
(
w
⋅
x
i
)
)
]
∂
L
(
w
)
∂
w
=
y
i
⋅
x
i
−
1
1
+
exp
(
w
⋅
x
i
)
exp
(
w
⋅
x
i
)
⋅
x
i
=
y
i
⋅
x
i
−
x
i
⋅
exp
(
w
⋅
x
i
)
1
+
exp
(
w
⋅
x
i
)
在我们猜测概率时,不确定的部分我们认为是等可能的,就好像骰子一样,我们知道有6个面,因此认为每个面的概率是
1
/
6
1 / 6
1/6 ,也就是等可能。
换句话说,就是趋向于均匀分布,最大熵使用的就是一个这么朴素的道理:凡是我们知道的,就把它考虑进去,凡是不知道的, 通通均匀分布。
终极目标:
P
(
Y
∣
X
)
P(Y \mid X)
P(Y∣X)
熵:
H
(
P
)
=
−
∑
x
p
(
x
)
log
P
(
x
)
H(P)=-\sum_{x} p(x) \log P(x)
H(P)=−x∑p(x)logP(x)
将终极目标代入熵:
H
(
P
)
=
−
∑
x
p
(
y
∣
x
)
log
P
(
y
∣
x
)
H(P)=-\sum_{x} p(y \mid x) \log P(y \mid x)
H(P)=−x∑p(y∣x)logP(y∣x)
做些改变,调整熵:
H
(
P
)
=
−
∑
x
P
~
(
x
)
p
(
y
∣
x
)
log
P
(
y
∣
x
)
H(P)=-\sum_{x} \widetilde{P}(x) p(y \mid x) \log P(y \mid x)
H(P)=−x∑P
(x)p(y∣x)logP(y∣x)
特征函数
f
(
x
,
y
)
=
{
1
,
x
与
y
满足某一事实
0
,
否则
f(x, y)=
特征函数
f
(
x
,
y
)
f(x, y)
f(x,y)关于经验分布
P
~
(
x
,
y
)
\widetilde{P}(x, y)
P
(x,y)的期望值:
E
p
~
(
f
)
=
∑
x
,
y
P
~
(
x
,
y
)
f
(
x
,
y
)
=
∑
x
,
y
P
~
(
x
)
P
~
(
y
∣
x
)
f
(
x
,
y
)
E_{\widetilde{p}}(f)=\sum_{x, y} \widetilde{P}(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y)
Ep
(f)=x,y∑P
(x,y)f(x,y)=x,y∑P
(x)P
(y∣x)f(x,y)
特征函数
f
(
x
,
y
)
f(x, y)
f(x,y)关于经验分布
P
(
x
,
y
)
P(x, y)
P(x,y)的期望值:
E
p
(
f
)
=
∑
x
,
y
P
(
x
,
y
)
f
(
x
,
y
)
=
∑
x
,
y
P
~
(
x
)
P
(
y
∣
x
)
f
(
x
,
y
)
E_{p}(f)=\sum_{x, y} P(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) P(y \mid x) f(x, y)
Ep(f)=x,y∑P(x,y)f(x,y)=x,y∑P
(x)P(y∣x)f(x,y)
约束:
E
p
~
(
f
)
=
E
p
(
f
)
E_{\widetilde{p}}(f)=E_{p}(f)
Ep
(f)=Ep(f)
max
P
∈
C
H
(
P
)
=
−
∑
x
,
y
P
~
(
x
)
P
~
(
y
∣
x
)
f
(
x
,
y
)
s.t.
E
p
~
(
f
)
−
E
p
(
f
)
=
0
∑
y
P
(
y
∣
x
)
=
1
min
P
∈
C
H
(
P
)
=
∑
x
,
y
P
~
(
x
)
P
~
(
y
∣
x
)
f
(
x
,
y
)
s.t.
E
p
~
(
f
)
−
E
p
(
f
)
=
0
∑
y
P
(
y
∣
x
)
=
1
L
(
P
,
w
)
≡
−
H
(
P
)
+
w
0
(
1
−
∑
y
P
(
y
∣
x
)
)
+
∑
i
=
1
n
w
i
(
E
P
~
(
f
i
)
−
E
P
(
f
i
)
)
=
∑
x
,
y
P
~
(
x
)
P
(
y
∣
x
)
log
P
(
y
∣
x
)
+
w
0
(
1
−
∑
y
P
(
y
∣
x
)
)
+
∑
i
=
1
n
w
i
(
∑
x
,
y
P
~
(
x
,
y
)
f
i
(
x
,
y
)
−
∑
x
,
y
P
~
(
x
)
P
(
y
∣
x
)
f
i
(
x
,
y
)
)
min P ∈ C max w L ( P , w ) → max w min P ∈ C L ( P , w ) \min _{P \in C} \max _{w} L(P, w) \rightarrow \max _{w} \min _{P \in C} L(P, w) P∈CminwmaxL(P,w)→wmaxP∈CminL(P,w)
P
w
(
y
∣
x
)
=
1
Z
w
(
x
)
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
Z
w
(
x
)
=
∑
y
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
已知要解决的目标:
P
w
(
y
∣
x
)
=
1
Z
w
(
x
)
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
Z
w
(
x
)
=
∑
y
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
所有的式子连乘取对数转换为似然函数为:
L
(
w
)
=
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
]
−
∑
x
[
P
~
(
x
)
ln
Z
w
(
x
)
]
L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln Z_{w}(x)\right]
L(w)=x,y∑[P~(x,y)i=1∑nwifi(x,y)]−x∑[P~(x)lnZw(x)]
IIS核心思想:每次增加一个量
δ
\delta
δ, 使得
L
(
w
+
δ
)
>
L
(
w
)
L(w+\delta)>L(w)
L(w+δ)>L(w),以此不断提高
L
L
L的值,直到达到极大值
L
(
w
+
δ
)
−
L
(
w
)
=
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
δ
i
f
i
(
x
,
y
)
]
−
∑
x
[
P
~
(
x
)
ln
Z
w
+
δ
(
x
)
Z
w
(
x
)
]
L(w+\delta)-L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right]
L(w+δ)−L(w)=x,y∑[P~(x,y)i=1∑nwiδifi(x,y)]−x∑[P~(x)lnZw(x)Zw+δ(x)]
其中
Z
w
+
δ
(
x
)
Z
w
(
x
)
=
1
Z
w
(
x
)
∑
y
exp
(
∑
i
=
1
n
(
w
i
+
δ
i
)
f
i
(
x
,
y
)
]
)
=
∑
y
1
Z
w
(
x
)
exp
(
∑
i
=
1
n
w
i
f
i
(
x
,
y
)
)
exp
(
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
)
=
∑
y
P
(
y
∣
x
)
exp
(
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
)
所以
L
(
w
+
δ
)
−
L
(
w
)
=
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
δ
i
f
i
(
x
,
y
)
]
−
∑
x
[
P
~
(
x
)
ln
Z
w
+
δ
(
x
)
Z
w
(
x
)
]
≥
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
δ
i
f
i
(
x
,
y
)
]
+
1
−
∑
x
P
~
(
x
)
∑
y
P
w
(
y
∣
x
)
exp
(
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
)
又
exp
(
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
)
=
exp
(
∑
i
=
1
n
f
i
(
x
,
y
)
f
∗
(
x
,
y
)
f
∗
(
x
,
y
)
δ
i
)
≤
∑
i
=
1
n
f
i
(
x
,
y
)
f
∗
(
x
,
y
)
exp
(
δ
i
f
∗
(
x
,
y
)
)
\exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) =\exp \left(\sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} f^{*}(x, y) \delta_{i}\right) \leq \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x,y)\right)
exp(i=1∑nδifi(x,y))=exp(i=1∑nf∗(x,y)fi(x,y)f∗(x,y)δi)≤i=1∑nf∗(x,y)fi(x,y)exp(δif∗(x,y))
所以
L
(
w
+
δ
)
−
L
(
w
)
=
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
δ
i
f
i
(
x
,
y
)
]
−
∑
x
[
P
~
(
x
)
ln
Z
w
+
δ
(
x
)
Z
w
(
x
)
]
≥
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
δ
i
f
i
(
x
,
y
)
]
+
1
−
∑
x
P
~
(
x
)
∑
y
P
w
(
y
∣
x
)
exp
(
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
)
≥
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
]
+
1
−
∑
x
P
~
(
x
)
∑
v
P
w
(
y
∣
x
)
∑
i
=
1
n
f
i
(
x
,
y
)
f
∗
(
x
,
y
)
exp
(
δ
i
f
∗
(
x
,
y
)
)
我们令
A
(
δ
∣
w
)
=
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
w
i
δ
i
f
i
(
x
,
y
)
]
+
1
−
∑
x
P
~
(
x
)
∑
y
P
w
(
y
∣
x
)
exp
(
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
)
B
(
δ
∣
w
)
=
∑
x
,
y
[
P
~
(
x
,
y
)
∑
i
=
1
n
δ
i
f
i
(
x
,
y
)
]
+
1
−
∑
x
P
~
(
x
)
∑
y
P
w
(
y
∣
x
)
∑
i
=
1
n
f
i
(
x
,
y
)
f
∗
(
x
,
y
)
exp
(
δ
i
f
∗
(
x
,
y
)
)
当
δ
=
0
\delta=0
δ=0, 有
A
(
δ
∣
w
)
=
0
B
(
δ
∣
w
)
=
0
所以
g
(
δ
i
)
=
∑
x
,
y
P
~
(
x
)
P
w
(
y
∣
x
)
f
i
exp
(
δ
i
f
∗
)
−
E
P
~
(
f
i
)
g
(
δ
i
)
=
0
δ
i
(
k
+
1
)
=
δ
i
(
k
)
−
g
(
δ
i
(
k
)
)
g
′
(
δ
i
(
k
)
)
总结:
IIS找到了原优化目标的一个下界,通过不断提高下界以此提高目标优化。