转载内容
推荐系统遇上深度学习(十)--GBDT+LR融合方案实战 - 知乎 (zhihu.com)
GBDT+LR算法解析及Python实现 - Bo_hemian - 博客园 (cnblogs.com)
GBDT与传统的Boosting区别较大,它的每一次计算都是为了减少上一次的残差,而为了消除残差,我们可以在残差减小的梯度方向上建立模型,所以说,在GradientBoost中,每个新的模型的建立是为了使得之前的模型的残差往梯度下降的方法,与传统的Boosting中关注正确错误的样本加权有着很大的区别。
在GradientBoosting算法中,关键就是利用损失函数的负梯度方向在当前模型的值作为残差的近似值,进而拟合一棵CART回归树。
GBDT的会累加所有树的结果,而这种累加是无法通过分类完成的,因此GBDT的树都是CART回归树,而不是分类树(尽管GBDT调整后也可以用于分类但不代表GBDT的树为分类树)。GBDT的性能在RF的基础上又有一步提升,因此其优点也很明显
1、它能灵活的处理各种类型的数据;
2、在相对较少的调参时间下,预测的准确度较高。
由于它是Boosting,因此基学习器之前存在串行关系,难以并行训练数据。

- import pandas as pd
- import lightgbm as lgb
- from sklearn.preprocessing import OneHotEncoder
- from sklearn.model_selection import train_test_split
- from sklearn.ensemble import GradientBoostingClassifier
- import numpy as np
-
-
- data = pd.read_csv('loan_data 银行借贷决策树.txt',sep='\s+',encoding='utf-8',index_col='nameid')
- data.head()
-
- data.columns
-
- x = data.drop(['approve'],axis = 1).values
- y = data['approve'].values
- print(x.shape,y.shape)
-
- x1 = x[:900]
- y1 = y[:900]
- x2 = x[900:]
- y2 = y[900:]
-
- X_train,X_test,y_train,y_test = train_test_split(x1,y1,test_size = 0.2)
-
- gbm1 = GradientBoostingClassifier(n_estimators=50, random_state=10, subsample=0.6, max_depth=7,
- min_samples_split=900)
- gbm1.fit(X_train, y_train)
-
- #用实际数据去实现GBDT特征的提取,
- # model.apply(X_train),返回训练数据X_train在训练好的模型里每棵树中所处的叶子节点的位置(索引)
- train_new_feature = gbm1.apply(X_train)
- train_new_feature = train_new_feature.reshape(-1, 50)
-
- # 转换成GBDT特征
- #pandas中的 get_dummies()也可以实现独热编码
- enc = OneHotEncoder()
- enc.fit(train_new_feature)
- train_new_feature2 = np.array(enc.transform(train_new_feature).toarray())
-
- # create dataset for lightgbm
- lgb_train = lgb.Dataset(X_train, y_train)
- lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
-
- params = {
- 'task': 'train',
- 'boosting_type': 'gbdt',
- 'objective': 'binary',
- 'metric': {'binary_logloss'},
- 'num_leaves': 64,
- 'num_trees': 100,
- 'learning_rate': 0.01,
- 'feature_fraction': 0.9,
- 'bagging_fraction': 0.8,
- 'bagging_freq': 5,
- 'verbose': 0
- }
-
- # number of leaves,will be used in feature transformation
- num_leaf = 64
-
- # train
- gbm = lgb.train(params=params,
- train_set=lgb_train,
- valid_sets=lgb_train, )
-
- # save model to file
- gbm.save_model('model.txt')
-
- # y_pred分别落在100棵树上的哪个节点上
- y_pred = gbm.predict(X_train, pred_leaf=True)
- #返回预测的每一类的概率
- y_pred_prob = gbm.predict(X_train)
-
- '''GradientBoostingClassifier默认生成的子树为100颗,
- y_pred的每一列其实就是对应着样本在每一颗子树落到的叶子节点。'''
- y_pred.shape
- #(720, 100)
-
- len(y_pred[1])#100,第二行的长度,也就是列的数目,子树为100颗
- len(y_pred)#720 样本数量
- len(y_pred[0])#100,第一行的长度,也就是列的数目,也是子树的数目
-
- result = []
- threshold = 0.5
- for pred in y_pred_prob:
- result.append(1 if pred > threshold else 0)
- print('result:', result)
-
- print('Writing transformed training data')
- transformed_training_matrix = np.zeros([len(y_pred), len(y_pred[1]) * num_leaf],
- dtype=np.int64) # N * num_tress * num_leafs
- for i in range(0, len(y_pred)):
- # temp表示在每棵树上预测的值所在节点的序号(0,64,128,...,6436 为100棵树的序号,中间的值为对应树的节点序号)
- temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
- # 构造one-hot 训练数据集
- transformed_training_matrix[i][temp] += 1
-
- #用测试集构造训练数据
- y_pred = gbm.predict(X_test, pred_leaf=True)
- print('Writing transformed testing data')
- transformed_testing_matrix = np.zeros([len(y_pred), len(y_pred[1]) * num_leaf], dtype=np.int64)
-
- for i in range(0, len(y_pred)):
- temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
- # 构造one-hot 测试数据集
- transformed_testing_matrix[i][temp] += 1
-
- #用LR进行预测
- from sklearn.linear_model import LogisticRegression
- lm = LogisticRegression(penalty='l2',C=0.05)
- lm.fit(transformed_training_matrix,y_train)
- # 每个类别的概率
- y_pred_test = lm.predict_proba(transformed_testing_matrix)
- y_pred_test
- NE = (-1) / len(y_pred_test) * sum(((1+y_test)/2 * np.log(y_pred_test[:,1]) + (1-y_test)/2 * np.log(1 - y_pred_test[:,1])))
- print("Normalized Cross Entropy " + str(NE))
在Facebook的paper中,模型使用NE(Normalized Cross-Entropy),进行评价,计算公式如下:

