• 【李航统计学习笔记】第六章:Logistic regression


    6.1 Logistic Regression

    Logistic分布

    回顾感知机:

    f ( x ) = sign ⁡ ( w ⋅ x + b ) f(x)=\operatorname{sign}(w \cdot x+b) f(x)=sign(wx+b)

    思考:

    1. 只输出-1和+1是不是太生硬了?这样的判别方式真的有效吗?
    2. 超平面左侧0.001距离的点和超平面右侧0.001距离的点真的有天壤之别吗?

    感知机的缺陷:

    1. 感知机通过梯度下降更新参数,但在 sign函数中, x = 0 x=0 x=0 是间断点,不可微。
    2. 感知机由于sign不是连续可微的,因此 在梯度下降时脱去了壳子sign函数。

    logistic regression定义:

    P ( Y = 1 ∣ x ) = exp ⁡ ( w ⋅ x ) 1 + exp ⁡ ( w ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ⁡ ( w ⋅ x )

    P(Y=1x)=exp(wx)1+exp(wx)P(Y=0x)=11+exp(wx)" role="presentation" style="position: relative;">P(Y=1x)=exp(wx)1+exp(wx)P(Y=0x)=11+exp(wx)
    P(Y=1x)=1+exp(wx)exp(wx)P(Y=0x)=1+exp(wx)1

    参数估计:

    Logistic regression模型学习时,对于给定的训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\right.\left.\left(x_{N}, y_{N}\right)\right\} T={(x1,y1),(x2,y2),,(xN,yN)},其中, x i ∈ R n , y i ∈ { 0 , 1 } x_{i} \in \mathbf{R}^{n}, \quad y_{i} \in\{0,1\} xiRn,yi{0,1},可以应用极大似然估计法估计模型参数,从而得到logistic regression模型。

    设:
    P ( Y = 1 ∣ x ) = π ( x ) , P ( Y = 0 ∣ x ) = 1 − π ( x ) P(Y=1 \mid x)=\pi(x), \quad P(Y=0 \mid x)=1-\pi(x) P(Y=1x)=π(x),P(Y=0x)=1π(x)
    似然函数为:
    ∏ i = 1 N [ π ( x i ) ] y i [ 1 − π ( x i ) ] 1 − y i \prod_{i=1}^{N}\left[\pi\left(x_{i}\right)\right]^{y_{i}}\left[1-\pi\left(x_{i}\right)\right]^{1-y_{i}} i=1N[π(xi)]yi[1π(xi)]1yi
    对数似然函数为:
    L ( w ) = ∑ i = 1 N [ y i log ⁡ π ( x i ) + ( 1 − y i ) log ⁡ ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i log ⁡ π ( x i ) 1 − π ( x i ) + log ⁡ ( 1 − π ( x i ) ) ] = ∑ i = 1 N [ y i ( w ⋅ x i ) − log ⁡ ( 1 + exp ⁡ ( w ⋅ x i ) ]

    L(w)=i=1N[yilogπ(xi)+(1yi)log(1π(xi))]=i=1N[yilogπ(xi)1π(xi)+log(1π(xi))]=i=1N[yi(wxi)log(1+exp(wxi)]" role="presentation" style="position: relative;">L(w)=i=1N[yilogπ(xi)+(1yi)log(1π(xi))]=i=1N[yilogπ(xi)1π(xi)+log(1π(xi))]=i=1N[yi(wxi)log(1+exp(wxi)]
    L(w)=i=1N[yilogπ(xi)+(1yi)log(1π(xi))]=i=1N[yilog1π(xi)π(xi)+log(1π(xi))]=i=1N[yi(wxi)log(1+exp(wxi)]
    L ( w ) L(w) L(w)求极大值,得到 w w w的估计值。

    似然函数对 w w w的求导:
    L ( w ) = ∑ i = 1 N [ y i ( w ⋅ x i ) − log ⁡ ( 1 + exp ⁡ ( w ⋅ x i ) ) ] ∂ L ( w ) ∂ w = y i ⋅ x i − 1 1 + exp ⁡ ( w ⋅ x i ) exp ⁡ ( w ⋅ x i ) ⋅ x i = y i ⋅ x i − x i ⋅ exp ⁡ ( w ⋅ x i ) 1 + exp ⁡ ( w ⋅ x i )

    L(w)=i=1N[yi(wxi)log(1+exp(wxi))]L(w)w=yixi11+exp(wxi)exp(wxi)xi=yixixiexp(wxi)1+exp(wxi)" role="presentation" style="position: relative;">L(w)=i=1N[yi(wxi)log(1+exp(wxi))]L(w)w=yixi11+exp(wxi)exp(wxi)xi=yixixiexp(wxi)1+exp(wxi)
    L(w)=i=1N[yi(wxi)log(1+exp(wxi))]wL(w)=yixi1+exp(wxi)1exp(wxi)xi=yixi1+exp(wxi)xiexp(wxi)

    总结:

    1. 逻辑斯谛以输出概率的形式解决了极小距离带来的 + 1和-1的天壤之别。同时概率也可作为模型输出的置信程度。
    2. 逻辑斯谛使得了最终的模型函数连续可微。训练目标与预测目标达成了一致。
    3. 逻辑斯谛采用了极大似然估计来估计参数。

    最大熵原理

    什么是最大熵?

    在我们猜测概率时,不确定的部分我们认为是等可能的,就好像骰子一样,我们知道有6个面,因此认为每个面的概率是 1 / 6 1 / 6 1/6 ,也就是等可能。
    换句话说,就是趋向于均匀分布,最大熵使用的就是一个这么朴素的道理:凡是我们知道的,就把它考虑进去,凡是不知道的, 通通均匀分布。

    最大熵模型

    终极目标:
    P ( Y ∣ X ) P(Y \mid X) P(YX)
    熵:
    H ( P ) = − ∑ x p ( x ) log ⁡ P ( x ) H(P)=-\sum_{x} p(x) \log P(x) H(P)=xp(x)logP(x)
    将终极目标代入熵:
    H ( P ) = − ∑ x p ( y ∣ x ) log ⁡ P ( y ∣ x ) H(P)=-\sum_{x} p(y \mid x) \log P(y \mid x) H(P)=xp(yx)logP(yx)
    做些改变,调整熵:
    H ( P ) = − ∑ x P ~ ( x ) p ( y ∣ x ) log ⁡ P ( y ∣ x ) H(P)=-\sum_{x} \widetilde{P}(x) p(y \mid x) \log P(y \mid x) H(P)=xP (x)p(yx)logP(yx)

    约束条件

    特征函数
    f ( x , y ) = { 1 , x  与  y  满足某一事实  0 ,  否则  f(x, y)=

    {1,x 与 y 满足某一事实 0, 否则 " role="presentation" style="position: relative;">{1,x 与 y 满足某一事实 0, 否则 
    f(x,y)={1,0,x  y 满足某一事实  否则 
    特征函数 f ( x , y ) f(x, y) f(x,y)关于经验分布 P ~ ( x , y ) \widetilde{P}(x, y) P (x,y)的期望值:
    E p ~ ( f ) = ∑ x , y P ~ ( x , y ) f ( x , y ) = ∑ x , y P ~ ( x ) P ~ ( y ∣ x ) f ( x , y ) E_{\widetilde{p}}(f)=\sum_{x, y} \widetilde{P}(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) \widetilde{P}(y \mid x) f(x, y) Ep (f)=x,yP (x,y)f(x,y)=x,yP (x)P (yx)f(x,y)
    特征函数 f ( x , y ) f(x, y) f(x,y)关于经验分布 P ( x , y ) P(x, y) P(x,y)的期望值:
    E p ( f ) = ∑ x , y P ( x , y ) f ( x , y ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) f ( x , y ) E_{p}(f)=\sum_{x, y} P(x, y) f(x, y)=\sum_{x, y} \widetilde{P}(x) P(y \mid x) f(x, y) Ep(f)=x,yP(x,y)f(x,y)=x,yP (x)P(yx)f(x,y)
    约束:
    E p ~ ( f ) = E p ( f ) E_{\widetilde{p}}(f)=E_{p}(f) Ep (f)=Ep(f)

    max ⁡ P ∈ C H ( P ) = − ∑ x , y P ~ ( x ) P ~ ( y ∣ x ) f ( x , y )  s.t.  E p ~ ( f ) − E p ( f ) = 0 ∑ y P ( y ∣ x ) = 1 min ⁡ P ∈ C H ( P ) = ∑ x , y P ~ ( x ) P ~ ( y ∣ x ) f ( x , y )  s.t.  E p ~ ( f ) − E p ( f ) = 0 ∑ y P ( y ∣ x ) = 1

    maxPCH(P)=x,yP~(x)P~(yx)f(x,y) s.t. Ep~(f)Ep(f)=0yP(yx)=1minPCH(P)=x,yP~(x)P~(yx)f(x,y) s.t. Ep~(f)Ep(f)=0yP(yx)=1" role="presentation" style="position: relative;">maxPCH(P)=x,yP~(x)P~(yx)f(x,y) s.t. Ep~(f)Ep(f)=0yP(yx)=1minPCH(P)=x,yP~(x)P~(yx)f(x,y) s.t. Ep~(f)Ep(f)=0yP(yx)=1
    maxPC s.t. minPC s.t. H(P)=x,yP (x)P (yx)f(x,y)Ep (f)Ep(f)=0yP(yx)=1H(P)=x,yP (x)P (yx)f(x,y)Ep (f)Ep(f)=0yP(yx)=1

    拉格朗日乘子法

    L ( P , w ) ≡ − H ( P ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 n w i ( E P ~ ( f i ) − E P ( f i ) ) = ∑ x , y P ~ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 n w i ( ∑ x , y P ~ ( x , y ) f i ( x , y ) − ∑ x , y P ~ ( x ) P ( y ∣ x ) f i ( x , y ) )

    L(P,w)H(P)+w0(1yP(yx))+i=1nwi(EP~(fi)EP(fi))=x,yP~(x)P(yx)logP(yx)+w0(1yP(yx))+i=1nwi(x,yP~(x,y)fi(x,y)x,yP~(x)P(yx)fi(x,y))" role="presentation" style="position: relative;">L(P,w)H(P)+w0(1yP(yx))+i=1nwi(EP~(fi)EP(fi))=x,yP~(x)P(yx)logP(yx)+w0(1yP(yx))+i=1nwi(x,yP~(x,y)fi(x,y)x,yP~(x)P(yx)fi(x,y))
    L(P,w)=H(P)+w0(1yP(yx))+i=1nwi(EP~(fi)EP(fi))x,yP~(x)P(yx)logP(yx)+w0(1yP(yx))+i=1nwi(x,yP~(x,y)fi(x,y)x,yP~(x)P(yx)fi(x,y))

    min ⁡ P ∈ C max ⁡ w L ( P , w ) → max ⁡ w min ⁡ P ∈ C L ( P , w ) \min _{P \in C} \max _{w} L(P, w) \rightarrow \max _{w} \min _{P \in C} L(P, w) PCminwmaxL(P,w)wmaxPCminL(P,w)

    P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z w ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) )

    Pw(yx)=1Zw(x)exp(i=1nwifi(x,y))Zw(x)=yexp(i=1nwifi(x,y))" role="presentation" style="position: relative;">Pw(yx)=1Zw(x)exp(i=1nwifi(x,y))Zw(x)=yexp(i=1nwifi(x,y))
    Pw(yx)Zw(x)=Zw(x)1exp(i=1nwifi(x,y))=yexp(i=1nwifi(x,y))

    总结

    1. 最大熵强调不提任何假设,以熵最大为目标。
    2. 将终极目标代入熵的公式后,将其最大化。
    3. 在训练集中寻找现有的约束,计算期望,将其作为约束。使用拉格朗日乘子法得到 p ( y ∣ x ) p(y \mid x) p(yx) ,之后使用优化算法得到 p ( y ∣ x ) p(y \mid x) p(yx) 中的参数 w w w

    6.2 改进的尺度迭代法(IIS)

    已知要解决的目标:
    P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z w ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) )

    Pw(yx)=1Zw(x)exp(i=1nwifi(x,y))Zw(x)=yexp(i=1nwifi(x,y))" role="presentation" style="position: relative;">Pw(yx)=1Zw(x)exp(i=1nwifi(x,y))Zw(x)=yexp(i=1nwifi(x,y))
    Pw(yx)Zw(x)=Zw(x)1exp(i=1nwifi(x,y))=yexp(i=1nwifi(x,y))
    所有的式子连乘取对数转换为似然函数为:
    L ( w ) = ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n w i f i ( x , y ) ] − ∑ x [ P ~ ( x ) ln ⁡ Z w ( x ) ] L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln Z_{w}(x)\right] L(w)=x,y[P~(x,y)i=1nwifi(x,y)]x[P~(x)lnZw(x)]
    IIS核心思想:每次增加一个量 δ \delta δ, 使得 L ( w + δ ) > L ( w ) L(w+\delta)>L(w) L(w+δ)>L(w),以此不断提高 L L L的值,直到达到极大值
    L ( w + δ ) − L ( w ) = ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n w i δ i f i ( x , y ) ] − ∑ x [ P ~ ( x ) ln ⁡ Z w + δ ( x ) Z w ( x ) ] L(w+\delta)-L(w)=\sum_{x, y}\left[\tilde{P}(x, y) \sum_{i=1}^{n} w_{i} \delta_{i} f_{i}(x, y)\right]-\sum_{x}\left[\tilde{P}(x) \ln \frac{Z_{w+\delta}(x)}{Z_{w}(x)}\right] L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw(x)Zw+δ(x)]
    其中
    Z w + δ ( x ) Z w ( x ) = 1 Z w ( x ) ∑ y exp ⁡ ( ∑ i = 1 n ( w i + δ i ) f i ( x , y ) ] ) = ∑ y 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) = ∑ y P ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) )
    Zw+δ(x)Zw(x)=1Zw(x)yexp(i=1n(wi+δi)fi(x,y)])=y1Zw(x)exp(i=1nwifi(x,y))exp(i=1nδifi(x,y))=yP(yx)exp(i=1nδifi(x,y))" role="presentation" style="position: relative;">Zw+δ(x)Zw(x)=1Zw(x)yexp(i=1n(wi+δi)fi(x,y)])=y1Zw(x)exp(i=1nwifi(x,y))exp(i=1nδifi(x,y))=yP(yx)exp(i=1nδifi(x,y))
    Zw(x)Zw+δ(x)=Zw(x)1yexp(i=1n(wi+δi)fi(x,y)])=yZw(x)1exp(i=1nwifi(x,y))exp(i=1nδifi(x,y))=yP(yx)exp(i=1nδifi(x,y))

    所以
    L ( w + δ ) − L ( w ) = ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n w i δ i f i ( x , y ) ] − ∑ x [ P ~ ( x ) ln ⁡ Z w + δ ( x ) Z w ( x ) ] ≥ ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n w i δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) )
    L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw+δ(x)Zw(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))" role="presentation" style="position: relative;">L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw+δ(x)Zw(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))
    L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw(x)Zw+δ(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))


    exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) = exp ⁡ ( ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) f ∗ ( x , y ) δ i ) ≤ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ⁡ ( δ i f ∗ ( x , y ) ) \exp \left(\sum_{i=1}^{n} \delta_{i} f_{i}(x, y)\right) =\exp \left(\sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} f^{*}(x, y) \delta_{i}\right) \leq \sum_{i=1}^{n} \frac{f_{i}(x, y)}{f^{*}(x, y)} \exp \left(\delta_{i} f^{*}(x,y)\right) exp(i=1nδifi(x,y))=exp(i=1nf(x,y)fi(x,y)f(x,y)δi)i=1nf(x,y)fi(x,y)exp(δif(x,y))

    所以
    L ( w + δ ) − L ( w ) = ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n w i δ i f i ( x , y ) ] − ∑ x [ P ~ ( x ) ln ⁡ Z w + δ ( x ) Z w ( x ) ] ≥ ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n w i δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) ≥ ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ v P w ( y ∣ x ) ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ⁡ ( δ i f ∗ ( x , y ) )

    L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw+δ(x)Zw(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)vPw(yx)i=1nfi(x,y)f(x,y)exp(δif(x,y))" role="presentation" style="position: relative;">L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw+δ(x)Zw(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)vPw(yx)i=1nfi(x,y)f(x,y)exp(δif(x,y))
    L(w+δ)L(w)=x,y[P~(x,y)i=1nwiδifi(x,y)]x[P~(x)lnZw(x)Zw+δ(x)]x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)vPw(yx)i=1nf(x,y)fi(x,y)exp(δif(x,y))
    我们令
    A ( δ ∣ w ) = ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n w i δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) B ( δ ∣ w ) = ∑ x , y [ P ~ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) ] + 1 − ∑ x P ~ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ⁡ ( δ i f ∗ ( x , y ) )
    A(δw)=x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))B(δw)=x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)yPw(yx)i=1nfi(x,y)f(x,y)exp(δif(x,y))" role="presentation" style="position: relative;">A(δw)=x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))B(δw)=x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)yPw(yx)i=1nfi(x,y)f(x,y)exp(δif(x,y))
    A(δw)=x,y[P~(x,y)i=1nwiδifi(x,y)]+1xP~(x)yPw(yx)exp(i=1nδifi(x,y))B(δw)=x,y[P~(x,y)i=1nδifi(x,y)]+1xP~(x)yPw(yx)i=1nf(x,y)fi(x,y)exp(δif(x,y))

    δ = 0 \delta=0 δ=0, 有
    A ( δ ∣ w ) = 0 B ( δ ∣ w ) = 0
    A(δw)=0B(δw)=0" role="presentation" style="position: relative;">A(δw)=0B(δw)=0
    A(δw)=0B(δw)=0

    所以
    g ( δ i ) = ∑ x , y P ~ ( x ) P w ( y ∣ x ) f i exp ⁡ ( δ i f ∗ ) − E P ~ ( f i ) g ( δ i ) = 0 δ i ( k + 1 ) = δ i ( k ) − g ( δ i ( k ) ) g ′ ( δ i ( k ) )
    g(δi)=x,yP~(x)Pw(yx)fiexp(δif)EP~(fi)g(δi)=0δi(k+1)=δi(k)g(δi(k))g(δi(k))" role="presentation" style="position: relative;">g(δi)=x,yP~(x)Pw(yx)fiexp(δif)EP~(fi)g(δi)=0δi(k+1)=δi(k)g(δi(k))g(δi(k))
    g(δi)g(δi)δi(k+1)=x,yP~(x)Pw(yx)fiexp(δif)EP~(fi)=0=δi(k)g(δi(k))g(δi(k))

    总结:

    IIS找到了原优化目标的一个下界,通过不断提高下界以此提高目标优化。

  • 相关阅读:
    KNN算法与SVM支持向量机
    【2024秋招】小米中间件后端开发一面2023-9-13-base武汉
    2022-8-20 B树和B+树
    Spring Boot自定义拦截器(HandlerInterceptor)使用
    定时执行专家 - 程序设计及源代码结构 by BoomWorks
    竞赛 深度学习YOLO安检管制物品识别与检测 - python opencv
    Python——字典数据存入excel
    机器学习 笔记05——特征工程之特征处理:字典特征提取、文本特征提取
    数学建模--预测类模型
    iOS——KVC(键值编码)
  • 原文地址:https://blog.csdn.net/weixin_39236489/article/details/126093399