单变量特征选择能够对每一个特征进行测试,衡量该特征和响应变量之间的关系,根据得分丢弃不好的特征。单变量特征选择这种方法比较简单,易于运行,易于理解
衡量变量之间的线性相关性,结果取值区间为[1,-1],-1表示完全的负相关,+1表示完全正相关,0表示没有线性相关。皮尔森相关系数表示两个变量之间的协方差与标准差的商。
皮尔森相关系数计算速度快,易于计算,在经过清洗和特征提取之后的第一时间就可以执行。
![\rho _{X,Y} = \frac{E[(X-\mu _{X})(Y-\mu_{Y})]}{\sigma _{X}\sigma _{Y}}](https://1000bd.com/contentImg/2023/11/08/070727421.png)
- import numpy as np
- from scipy.stats import pearsonr
- np.random.seed(0)
- size = 300
- x = np.random.normal(0,1,size)
- print("lower noise",pearsonr(x,x + np.random.normal(0,1,size)))
- print("higher noise",pearsonr(x,x + np.random.normal(0,10,size)))
- # lower noise (0.7182483686213841, 7.324017312998504e-49)
- # higher noise (0.05796429207933814, 0.31700993885325246)
噪音比较小的时候,相关性很强,P-value很低。皮尔森相关系数的缺点:只对线性关系敏感
距离相关系数客服了皮尔森相关系数的弱点,它基于距离协方差进行变量间相关性度量,其优点为变量的大小不是必须一致的。计算距离相关系数通常使用的值为平方根。

- def dist(x,y):
- return np.abs(x[:,None] - y)
-
- def d_n(x):
- d= dist(x,x)
- dn = d - d.mean(0) - d.mean(1)[:,None] + d.mean()
- return dn
-
- def dcov_all(x,y):
- dnx = d_n(x)
- dny = d_n(y)
-
- denom = np.product(dnx.shape)
- dc = (dnx * dny).sum()/denom
- dvx = (dnx ** 2).sum()/denom
- dvy = (dny ** 2).sum()/denom
- dr = dc/(np.sqrt(dvx)*np.sqrt(dvy))
- return np.sqrt(sr)
-
- x = np.random.uniform(-1,1,10000)
- dc = dcov_all(x,x**2)
- print(dc)
- from sklearn.linear_model import Lasso
- from sklearn.preprocessing import StandardScaler
- from sklearn.datasets import load_boston
- data = load_boston()
- #
- scaler = StandardScaler()
- X = scaler.fit_transform(data.data)
- Y = data.target
- names = data.feature_names
-
- lasso = Lasso(alpha = 0.3)
- lasso.fit(X,Y)
- print('Lasso model',lasso.coef_,names)
-
- '''Lasso model [-0.24227912 0.081819 -0. 0.53987192 -0.69891258 2.99322993
- -0. -1.08091325 0. -0. -1.75561249 0.62831526 -3.70463287]
- ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
- 'B' 'LSTAT']'''
可以看到,很多特征的系数都是0。如果继续增加alpha的值,得到的模型就会越来越稀疏,即越来越多的特征系数会变成0。然而, L1正则化像非正则化线性模型一样也是不稳定的,如果特征集合中具有相关联的特征,当数据发生细微变化时也有可能导致很大的模型差异。
将系数向量的L2范数添加到了损失函数中。由于L2惩罚项中系数是二次方的,L2和L1相较,L2正则化会让系数的取值变得平均。对于关联特征,这意味着他们能够获得更相近的对应系数。L2正则化对于特征选择来说一种稳定的模型,不像L1正则化那样,系数会因为细微的数据变化而波动。所以L2正则化和L1正则化提供的价值是不同的,L2正则化对于特征理解来说更加有用:表示能力强的特征对应的系数是非零。
- #3个互相关联的特征的例子,分别以10个不同的种子随机初始化运行10次
- #来观察L1和L2正则化的稳定性。
- from sklearn.linear_model import Ridge
- from sklearn.linear_model import LinearRegression
- from sklearn.metrics import r2_score
-
- size = 100
- for i in range(10):
- print ("Random seed %s" % i)
- np.random.seed(seed = i)
- X_seed = np.random.normal(0,1,size)
- X1 = X_seed + np.random.normal(0,.1,size)
- X2 = X_seed + np.random.normal(0,.1,size)
- X3 = X_seed + np.random.normal(0,.1,size)
- Y = X1 + X2 + X3 + np.random.normal(0,1,size)
- X = np.array([X1,X2,X3]).T
-
- lr = LinearRegression()
- lr.fit(X,Y)
- print('Linear model',lr.coef_)
-
- ridge = Ridge(alpha = 10)
- ridge.fit(X,Y)
- print('Ridge model',ridge.coef_)
-
- '''Random seed 0
- Linear model [ 0.7284403 2.30926001 -0.08219169]
- Ridge model [0.93832131 1.05887277 0.87652644]
- 即:
- Linear model [ 0.7284403*X0 + 2.30926001*X1 -0.08219169*X2]
- Ridge model [0.93832131*X0 + 1.05887277*X1 + 0.87652644*X2]
- Random seed 1
- Linear model [ 1.15181561 2.36579916 -0.59900864]
- Ridge model [0.98409577 1.06792673 0.75855367]
- Random seed 2
- Linear model [0.69734749 0.32155864 2.08590886]
- Ridge model [0.97159124 0.94256202 1.08539406]
- Random seed 3
- Linear model [0.28735446 1.25386129 1.49054726]
- Ridge model [0.91891806 1.00474386 1.03276594]
- Random seed 4
- Linear model [0.18726691 0.77214206 2.1894915 ]
- Ridge model [0.96401621 0.98152524 1.0983599 ]
- Random seed 5
- Linear model [-1.2912413 1.59097473 2.74727029]
- Ridge model [0.75819864 1.01085804 1.1390417 ]
- Random seed 6
- Linear model [ 1.19909595 -0.0306915 1.91454912]
- Ridge model [1.01616507 0.89032238 1.0907386 ]
- Random seed 7
- Linear model [ 1.47386769 1.76236014 -0.15057274]
- Ridge model [1.0179376 1.03865514 0.90082373]
- Random seed 8
- Linear model [0.0840547 1.87985845 1.10688887]
- Ridge model [0.90685834 1.07119752 1.00837994]
- Random seed 9
- Linear model [0.71408648 0.77601368 1.36406398]
- Ridge model [0.89617178 0.90340866 0.98015958]'''
从输出可以看出,不同数据上线性回归的模型系数波动较大而L2正则化模型结果中的系数非常稳定,差别较小,都较接近1,能够反应数据的内在结构。
随机森林具有准确率高、鲁棒性好、易于使用等优点,这使得它成为了目前最流行的机器学习算法之一。
随机森林提供了两种特征选择的方法mean:decrease accuracy和mean decrease impurity。
在波士顿房价数据集上使用sklearn的随机森林回归给出一个单变量选择的例子:
- from sklearn.datasets import load_boston
- from sklearn.model_selection import cross_val_score,ShuffleSplit
- from sklearn.ensemble import RandomForestRegressor as RFR
- import numpy as np
-
- data = load_boston()
- X = data.data
- y = data.target
- names = data.feature_names
- rf = RFR(n_estimators = 20,max_depth = 4)
- scores = []
- for i in range(X.shape[1]):
- score = cross_val_score(rf,X[:,i:i+1],y,scoring = 'r2',cv = ShuffleSplit(len(X),3,.3))
- scores.append((round(np.mean(score),3),names[i]))
- # ShuffleSplit分割训练集和测试集,round 对结果进行四舍五入
- print (sorted(scores,reverse = True))
-
- '''[(-2.402, 'CHAS'), (-3.584, 'PTRATIO'), (-3.701, 'INDUS'), (-3.751, 'NOX'), (-4.113, 'LSTAT'), (-4.661, 'RAD'), (-4.807, 'DIS'), (-5.696, 'ZN'), (-5.836, 'TAX'), (-5.921, 'RM'), (-9.324, 'B'), (-24.344, 'CRIM'), (-34.624, 'AGE')]'''
特征选择 XGBoost为工业级用的比较多的模型,其某个特征的重要性(feature score),等于它被选中为树节点分裂特征的次数的和,比如特征A在第一次迭代中(即第一棵树)被选中了1次去分裂树节点,在第二次迭代被选中2次,那么最终特征A的featurescore就是1+2,可以利用其特征的重要性对特征进行选择。XGBoost的特征选择的代码如下:
数据集:Allstate Claims Severity | Kaggle
- import numpy as np
- import pandas as pd
- import xgboost as xgb
- import operator
- import matplotlib.pyplot as plt
-
- def create_feature_map(features):
- outfile = open('xgb.fmap','w')
- i = 0
- for feat in features:
- outfile.write('{0}\t{1}\tq\n'.format(i,feat))
- i = i+1
- outfile.close()
-
- if __name__ == '__main__':
- train = pd.read_csv('Allstate Claims Severity/train.csv')
-
- cat_sel = [n for n in train.columns if n.startswith('cat')]
- #类别特征数值化
-
- for column in cat_sel:
- train[column] = pd.factorize(train[column].values , sort=True)[0] + 1
-
- params ={
- 'min_child_weight': 100,
- 'eta': 0.02,
- 'colsample_bytree':0.7,
- 'max_depth': 12,
- 'subsample': 0.7,
- 'alpha': 1,
- 'gamma': 1,
- 'silent': 1,
- 'verbose_eval': True,
- 'seed':12
- }
-
- rounds = 10
- y = train['loss']
- X = train.drop(['loss', 'id'], 1)
-
- #构造xgb的训练集
- xgtrain = xgb.DMatrix(X, label=y)
- bst = xgb.train(params, xgtrain, num_boost_round=rounds)
-
-
- features = [x for x in train.columns if x not in ['id','loss']]
- create_feature_map(features)
-
- importance = bst.get_fscore(fmap='xgb.fmap')
- #自定义key=operator.itemgetter(1),按importance字典中的分数排序
- importance = sorted(importance.items(), key=operator.itemgetter(1))
-
- #构建特征评分数据框
- df = pd.DataFrame(importance, columns=['feature', 'fscore'])
- df['fscore'] = df['fscore'] / df['fscore'].sum()
- #保存特征评分数据框文件
- df.to_csv("Allstate Claims Severity/feat_importance.csv", index=False)
-
- #可视化特征重要程度
- plt.figure()
- df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
- plt.title('XGBoost Feature Importance')
- plt.xlabel('relative importance')
- plt.show()
