Xgboost模型在机器学习、深度学习中经久不衰,不论是分类还是回归任务都是一个不错的baseline甚至最终可用的模型,XGB对任务的普适性也决定了其具有大量的可调节参数,针对同一个任务,不同的参数设置可能带来不同甚至相差甚远的性能结果,因为寻找当前任务下可用、有效的参数是一个必不可少的过程,在上一篇文章XGB系列-XGB参数指南_wwlsm_zql的博客-CSDN博客在运行 XGBoost 之前,我们必须设置三种类型的参数: 通用参数、提升参数和任务参数。本文提供了对XGB模型的全部参数的介绍,用于指导对参数的选择https://blog.csdn.net/wwlsm_zql/article/details/126192959介绍了XGB的所有参数,针对如果繁多的参数,试探枚举是一个非常庞大的工作量,因此本文介绍通过hyperopt实现自动参数寻优,找到适合自己任务的最佳参数。
!pip install xgboost sklearn hyperopt
- # 导入基本包
- import pandas as pd
- import numpy as np
- import xgboost as xgb
- from sklearn.metrics import accuracy_score
- from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
- from sklearn.model_selection import train_test_split
- df = pd.read_csv("drive/MyDrive/data_daily/Wholesalecustomersdata.csv")
-
- x = df.drop('Channel', axis=1)
- y = df['Channel']
- """将分类任务转换为0-1"""
- y[y == 2] = 0
- y[y == 1] = 1
-
- """切分数据集"""
- X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
The available hyperopt optimization algorithms are -
hp.choice(label, options) — Returns one of the options, which should be a list or tuple.
hp.randint(label, upper) — Returns a random integer between the range [0, upper).
hp.uniform(label, low, high) — Returns a value uniformly between low and high.
hp.quniform(label, low, high, q) — Returns a value round(uniform(low, high) / q) * q, i.e it rounds the decimal values and returns an integer.
hp.normal(label, mean, std) — Returns a real value that’s normally-distributed with mean and standard deviation sigma.
- space={'max_depth': hp.quniform("max_depth", 3, 18, 1),
- 'gamma': hp.uniform ('gamma', 1,9),
- 'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
- 'reg_lambda' : hp.uniform('reg_lambda', 0,1),
- 'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
- 'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
- 'n_estimators': 180,
- 'seed': 0
- }
- def objective(space):
- clf=xgb.XGBClassifier(
- n_estimators =space['n_estimators'], max_depth = int(space['max_depth']), gamma = space['gamma'],
- reg_alpha = int(space['reg_alpha']),min_child_weight=int(space['min_child_weight']),
- colsample_bytree=int(space['colsample_bytree']))
-
- evaluation = [( X_train, y_train), ( X_test, y_test)]
-
- clf.fit(X_train, y_train,
- eval_set=evaluation, eval_metric="auc",
- early_stopping_rounds=10,verbose=False)
-
-
- pred = clf.predict(X_test)
- accuracy = accuracy_score(y_test, pred>0.5)
- print ("SCORE:", accuracy)
- return {'loss': -accuracy, 'status': STATUS_OK }
- trials = Trials()
-
- best_hyperparams = fmin(fn = objective,
- space = space,
- algo = tpe.suggest,
- max_evals = 100,
- trials = trials)
Here best_hyperparams gives us the optimal parameters that best fit model and better loss function value.
trials is an object that contains or stores all the relevant information such as hyperparameter, loss-functions for each set of parameters that the model has been trained.
'fmin' is an optimization function that minimizes the loss function and takes in 4 inputs - fn, space, algo and max_evals.
Algorithm used is tpe.suggest.
- print("The best hyperparameters are : ","\n")
- print(best_hyperparams)