一般情况下,使用集成算法去看特征重要性比较好。
关注某一个特征,计算其permutation importance:
训练好当前模型
考虑特征A对模型结果的影响。将特征A打乱顺序,比较模型结果,误差是否变得更大。如果误差改变不大,说明该特征不重要,如果误差改变大,则重要。
工具包 eli5 https://eli5.readthedocs.io/en/latest/tutorials/xgboost-titanic.html#explaining-weights
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes") # 转换标签
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
X.head()
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
import eli5 #pip install eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())
