• 预测足球世界杯比赛


    目录

    1. 背景

    2. 输入特征选择

    3. 数据集-特征获取

    4. 数据预处理

    5. 模型训练与选择

    6. 预测

    7. 2018后新的数据

    8. 个人总结


    1. 背景

            在这四年一次的足球世界杯即将来临之际,公司将要举办一个模型预测与人工预测的比赛活动,虽然本人不是足球迷也不是伪球迷,但是绷着对机器学习的热爱,决定凭借自己匮乏的足球知识建个模型参加这个比赛。无论成败与否,享受其中即可。

    2. 输入特征选择

            特征的选择通俗的讲就是选择什么特征最能影响足球比赛。能我让这个不懂足球的人想到的是下面这些特征:

    1. 明星队员:比如梅西,C罗,姆巴佩,内马尔等等。但是这些球员的实力怎么衡量呢?还有就是不断有新的球星出现,是不是这个特征有点多?再有这个数据是不是好得到呢?

    2. 主客场:听说篮球这个特征很重要,但是世界杯就只有主办方一家是主场,其它是客场。

    3. 裁判:有些比赛裁判可能偏向一个国家队,这就操成该罚球的不罚,但是裁判这个特征随机性太强,每届都在变,而且你也不知道他偏向哪个队。

    4. 教练:感觉也是特征,而且比较稳定。但是怎么衡量呢

    5.球队状态:影响是有,但感觉无从下手的样子。

    6. 观众:历史数据应该还有办法,但是新比赛拿不到。

    最后就是球队了,如果球队的整体实力比较高,哪怕是没有明星队员,结果也不差,而且球队的整体实力其实暗含教练,球员和明星球员的这几个特征的。综合这些特征个方面的因素和提取的难易程度,最后选择了球队名字作为输入特征。

    3. 数据集-特征获取

            由于比赛来的太突然,就几天就要开始比赛当时,所以带着侥幸的心理在网上搜了一下是不是有相关的数据集,果然有很多。最终选择Kaggle里下载已经整理好的数据而且还是免费的。

    下载数据集

    下载后数据如下:

    FIFA World Cup | Kaggle

    百度网盘也可以下载

    链接: https://pan.baidu.com/s/1ky-73f9YTe2o1YI1TyMt9A 提取码: 1pt6

    4. 数据预处理

            虽然Kaggle这里提供的数据有很多特征,但是我们只需要三个特征,模型的输入是两个队的名字,模型的输出是结果(2:Home Team队胜,1:平,0:Away Team 队胜)。

    数据预处理大致的步骤是

    1. 从Excel里加载数据;
    2. 根据每队的得分生成新队列Winning_Team;
    3. 去掉和本次世界杯无关的国家队;
    4. 去掉无关的特征;
    5. winning_team数值化;
    6. 保持新的Excel;

    代码如下:

    1. import pandas as pd
    2. import numpy as np
    3. from sklearn.feature_extraction import DictVectorizer
    4. import joblib
    5. root_path = "models"
    6. #load data
    7. results = pd.read_csv('datasets/WorldCupMatches.csv', encoding='gbk')
    8. #Adding goal difference and establishing who is the winner
    9. winner = []
    10. for i in range (len(results['Home Team Name'])):
    11. if results ['Home Team Goals'][i] > results['Away Team Goals'][i]:
    12. winner.append(results['Home Team Name'][i])
    13. elif results['Home Team Goals'][i] < results ['Away Team Goals'][i]:
    14. winner.append(results['Away Team Name'][i])
    15. else:
    16. winner.append('Draw')
    17. results['winning_team'] = winner
    18. #adding goal difference column
    19. results['goal_difference'] = np.absolute(results['Home Team Goals'] - results['Away Team Goals'])
    20. # narrowing to team patcipating in the world cup, totally there are 32 football teams in 2022
    21. worldcup_teams = ['Qatar','Germany','Denmark', 'Brazil','France','Belgium', 'Serbia',
    22. 'Spain','Croatia', 'Switzerland', 'England','Netherlands', 'Argentina',' Iran',
    23. 'Korea Republic','Saudi Arabia', 'Japan', 'Uruguay','Ecuador','Canada',
    24. 'Senegal', 'Poland', 'Portugal','Tunisia', 'Morocco','Cameroon','USA',
    25. 'Mexico','Wales','Australia','Costa Rica', 'Ghana']
    26. df_teams_home = results[results['Home Team Name'].isin(worldcup_teams)]
    27. df_teams_away = results[results['Away Team Name'].isin(worldcup_teams)]
    28. df_teams = pd.concat((df_teams_home, df_teams_away))
    29. df_teams.drop_duplicates()
    30. df_teams.count()
    31. #dropping columns that wll not affect matchoutcomes
    32. df_teams_new =df_teams[[ 'Home Team Name','Away Team Name','winning_team']]
    33. print(df_teams_new.head() )
    34. #Building the model
    35. #the prediction label: The winning_team column will show "2" if the home team has won, "1" if it was a tie, and "0" if the away team has won.
    36. df_teams_new = df_teams_new.reset_index(drop=True)
    37. df_teams_new.loc[df_teams_new.winning_team == df_teams_new['Home Team Name'],'winning_team']=2
    38. df_teams_new.loc[df_teams_new.winning_team == 'Draw', 'winning_team']=1
    39. df_teams_new.loc[df_teams_new.winning_team == df_teams_new['Away Team Name'], 'winning_team']=0
    40. print(df_teams_new.count() )
    41. df_teams_new.to_csv('datasets/raw_train_data.csv', encoding='gbk', index =False)

    数据预处理后会在datasets文件下生成raw_train_data.csv,里面的数据如下:

    保持数据集大致步骤如下:

    1. 加载数据集;
    2. 利用DictVectorizer对两个队伍的名字进行数字化;
    3. 保持数据集方便后面训练模型用。

    代码如下:

    1. df_teams_new = pd.read_csv('datasets/raw_train_data.csv', encoding='gbk')
    2. feature = df_teams_new[['Home Team Name', 'Away Team Name']]
    3. vec = DictVectorizer(sparse=False)
    4. print(feature.to_dict(orient='records'))
    5. X = vec.fit_transform(feature.to_dict(orient='records'))
    6. X = X.astype('int')
    7. print("===")
    8. print(vec.get_feature_names())
    9. print(vec.feature_names_)
    10. y = df_teams_new[['winning_team']]
    11. y = y.astype('int')
    12. print(X.shape)
    13. print(y.shape)
    14. joblib.dump(vec, root_path + "/vec.joblib")
    15. np.savez('datasets/train_data', x=X, y=y)

    以上代码会在datasets文件下生成train_data.npz文件

    数据预处理的完整代码如下:

    1. import pandas as pd
    2. import numpy as np
    3. from sklearn.feature_extraction import DictVectorizer
    4. import joblib
    5. root_path = "models"
    6. def reprocess_dataset():
    7. #load data
    8. results = pd.read_csv('datasets/WorldCupMatches.csv', encoding='gbk')
    9. #Adding goal difference and establishing who is the winner
    10. winner = []
    11. for i in range (len(results['Home Team Name'])):
    12. if results ['Home Team Goals'][i] > results['Away Team Goals'][i]:
    13. winner.append(results['Home Team Name'][i])
    14. elif results['Home Team Goals'][i] < results ['Away Team Goals'][i]:
    15. winner.append(results['Away Team Name'][i])
    16. else:
    17. winner.append('Draw')
    18. results['winning_team'] = winner
    19. #adding goal difference column
    20. results['goal_difference'] = np.absolute(results['Home Team Goals'] - results['Away Team Goals'])
    21. # narrowing to team patcipating in the world cup, totally there are 32 football teams in 2022
    22. worldcup_teams = ['Qatar','Germany','Denmark', 'Brazil','France','Belgium', 'Serbia',
    23. 'Spain','Croatia', 'Switzerland', 'England','Netherlands', 'Argentina',' Iran',
    24. 'Korea Republic','Saudi Arabia', 'Japan', 'Uruguay','Ecuador','Canada',
    25. 'Senegal', 'Poland', 'Portugal','Tunisia', 'Morocco','Cameroon','USA',
    26. 'Mexico','Wales','Australia','Costa Rica', 'Ghana']
    27. df_teams_home = results[results['Home Team Name'].isin(worldcup_teams)]
    28. df_teams_away = results[results['Away Team Name'].isin(worldcup_teams)]
    29. df_teams = pd.concat((df_teams_home, df_teams_away))
    30. df_teams.drop_duplicates()
    31. df_teams.count()
    32. #dropping columns that wll not affect matchoutcomes
    33. df_teams_new =df_teams[[ 'Home Team Name','Away Team Name','winning_team']]
    34. print(df_teams_new.head() )
    35. #Building the model
    36. #the prediction label: The winning_team column will show "2" if the home team has won, "1" if it was a tie, and "0" if the away team has won.
    37. df_teams_new = df_teams_new.reset_index(drop=True)
    38. df_teams_new.loc[df_teams_new.winning_team == df_teams_new['Home Team Name'],'winning_team']=2
    39. df_teams_new.loc[df_teams_new.winning_team == 'Draw', 'winning_team']=1
    40. df_teams_new.loc[df_teams_new.winning_team == df_teams_new['Away Team Name'], 'winning_team']=0
    41. print(df_teams_new.count() )
    42. df_teams_new.to_csv('datasets/raw_train_data.csv', encoding='gbk', index =False)
    43. def save_dataset():
    44. df_teams_new = pd.read_csv('datasets/raw_train_data.csv', encoding='gbk')
    45. feature = df_teams_new[[ 'Home Team Name','Away Team Name']]
    46. vec = DictVectorizer(sparse=False)
    47. print(feature.to_dict(orient='records'))
    48. X =vec.fit_transform(feature.to_dict(orient='records'))
    49. X = X.astype('int')
    50. print("===")
    51. print(vec.get_feature_names())
    52. print(vec.feature_names_)
    53. y = df_teams_new[[ 'winning_team']]
    54. y =y.astype('int')
    55. print(X.shape)
    56. print(y.shape)
    57. joblib.dump(vec, root_path+"/vec.joblib")
    58. np.savez('datasets/train_data', x= X, y = y)
    59. if __name__ == '__main__':
    60. reprocess_dataset()
    61. save_dataset();

    reprocess_dataset() 方法是数据进行预处理。

    save_dataset() 方法是对预处理过的数据,进行向量化。

    由于这个数据集里没有2018年后的数据,所以我自己又手动收集了一下2018后新的数据。可以把这个新的数据集和上面预处理后数据集合在一起。

    2018年和2022年的数据可以下面百度网盘下载
    链接: https://pan.baidu.com/s/1fe_z6kRXB8T69wx1HBxO8g 提取码: o55d

    5. 模型训练与选择

    用不同的传统机器学习方法进行训练,训练后的模型比较

    ModelTraining AccuracyTest Accuracy
    Logistic Regression(逻辑回归)67.40%61.60%
    SVM(支持向量机)67.30%62.70%
    Naive Bayes(朴素贝叶斯)65.50%63.80%
    Random Forest(随机森林)90.80%65.50%
    XGB(X (Extreme) GBoosted)75.30%62.00%

    可以看到随机森林模型在测试集上准确率最高,所以我们可以用它来做预测。

    下面是完整训练代码:

    1. import pandas as pd
    2. import numpy as np
    3. import matplotlib.pyplot as plt
    4. import seaborn as sns
    5. import matplotlib.ticker as ticker
    6. import matplotlib.ticker as plticker
    7. from sklearn.model_selection import train_test_split
    8. from sklearn.linear_model import LogisticRegression
    9. from sklearn import svm
    10. import sklearn as sklearn
    11. from sklearn.feature_extraction import DictVectorizer
    12. from sklearn.naive_bayes import MultinomialNB
    13. from sklearn.ensemble import RandomForestClassifier
    14. import joblib
    15. from sklearn.metrics import classification_report
    16. from xgboost import XGBClassifier
    17. from sklearn.metrics import confusion_matrix
    18. root_path = "models"
    19. def get_dataset():
    20. train_data = np.load('datasets/train_data.npz')
    21. return train_data
    22. def train_by_LogisticRegression(train_data):
    23. X = train_data['x']
    24. y = train_data['y']
    25. # Separate train and test sets
    26. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    27. logreg = LogisticRegression()
    28. logreg.fit(X_train, y_train)
    29. joblib.dump(logreg, root_path+'/LogisticRegression_model.joblib')
    30. score = logreg.score(X_train, y_train)
    31. score2 = logreg.score(X_test, y_test)
    32. print("LogisticRegression Training set accuracy: ", '%.3f'%(score))
    33. print("LogisticRegression Test set accuracy: ", '%.3f'%(score2))
    34. def train_by_svm(train_data):
    35. X = train_data['x']
    36. y = train_data['y']
    37. # Separate train and test sets
    38. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    39. model = svm.SVC(kernel='linear', verbose=True, probability=True)
    40. model.fit(X_train, y_train)
    41. joblib.dump(model, root_path+'/svm_model.joblib')
    42. score = model.score(X_train, y_train)
    43. score2 = model.score(X_test, y_test)
    44. print("SVM Training set accuracy: ", '%.3f' % (score))
    45. print("SVM Test set accuracy: ", '%.3f' % (score2))
    46. def train_by_naive_bayes(train_data):
    47. X = train_data['x']
    48. y = train_data['y']
    49. # Separate train and test sets
    50. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    51. model = MultinomialNB()
    52. model.fit(X_train, y_train)
    53. joblib.dump(model, root_path+'/naive_bayes_model.joblib')
    54. score = model.score(X_train, y_train)
    55. score2 = model.score(X_test, y_test)
    56. print("naive_bayes Training set accuracy: ", '%.3f' % (score))
    57. print("naive_bayes Test set accuracy: ", '%.3f' % (score2))
    58. def train_by_random_forest(train_data):
    59. X = train_data['x']
    60. y = train_data['y']
    61. # Separate train and test sets
    62. X_train = X
    63. y_train = y
    64. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    65. model = RandomForestClassifier(criterion='gini', max_features='sqrt')
    66. model.fit(X_train, y_train)
    67. joblib.dump(model, root_path+'/random_forest_model.joblib')
    68. score = model.score(X_train, y_train)
    69. score2 = model.score(X_test, y_test)
    70. print("random forest Training set accuracy: ", '%.3f' % (score))
    71. print("random forest Test set accuracy: ", '%.3f' % (score2))
    72. def train_by_xgb(train_data):
    73. X = train_data['x']
    74. y = train_data['y']
    75. # Separate train and test sets
    76. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
    77. model = XGBClassifier(use_label_encoder=False)
    78. model.fit(X_train, y_train)
    79. joblib.dump(model, root_path+'/xgb_model.joblib')
    80. score = model.score(X_train, y_train)
    81. score2 = model.score(X_test, y_test)
    82. print("xgb Training set accuracy: ", '%.3f' % (score))
    83. print("xgb Test set accuracy: ", '%.3f' % (score2))
    84. y_pred = model.predict(X_test)
    85. report = classification_report(y_test, y_pred, output_dict=True)
    86. # show_confusion_matrix(y_test, y_pred)
    87. print(report)
    88. def show_confusion_matrix(y_true, y_pred, pic_name = "confusion_matrix"):
    89. confusion = confusion_matrix(y_true=y_true, y_pred=y_pred)
    90. print(confusion)
    91. sns.heatmap(confusion, annot=True, cmap= 'Blues', xticklabels=['0','1','2'], yticklabels=['0','1','2'], fmt = '.20g')
    92. plt.xlabel('Predicted class')
    93. plt.ylabel('Actual Class')
    94. plt.title(pic_name)
    95. # plt.savefig('pic/' + pic_name)
    96. plt.show()
    97. if __name__ == '__main__':
    98. train_data = get_dataset()
    99. train_by_LogisticRegression(train_data)
    100. train_by_svm(train_data)
    101. train_by_naive_bayes(train_data)
    102. train_by_random_forest(train_data)
    103. train_by_xgb(train_data)

    执行上面代码会生成下面文件:

    6. 预测

    预测逻辑大致如下:

    1. 输入两个队伍名字
    2. 对队伍名字进行验证
    3. 加载模型
    4. 利用模型进行预测并输出每个的概率

     执行下面预测代码,结果是Ecuador胜于Qatar, 英国队胜于伊朗队。

    1. [2]
    2. [[0.05       0.22033333 0.72966667]]
    3. Probability of  Ecuador  winning: 0.730
    4. Probability of Draw: 0.220
    5. Probability of  Qatar  winning: 0.050
    6. [2]
    7. [[0.02342857 0.21770455 0.75886688]]
    8. Probability of  England  winning: 0.759
    9. Probability of Draw: 0.218
    10. Probability of   Iran  winning: 0.023

    预测的完整代码如下:

    1. import joblib
    2. worldcup_teams = ['Qatar','Germany','Denmark', 'Brazil','France','Belgium', 'Serbia',
    3. 'Spain','Croatia', 'Switzerland', 'England','Netherlands', 'Argentina',' Iran',
    4. 'Korea Republic','Saudi Arabia', 'Japan', 'Uruguay','Ecuador','Canada',
    5. 'Senegal', 'Poland', 'Portugal','Tunisia', 'Morocco','Cameroon','USA',
    6. 'Mexico','Wales','Australia','Costa Rica', 'Ghana']
    7. root_path = "models"
    8. def verify_team_name(team_name):
    9. for worldcup_team in worldcup_teams:
    10. if team_name==worldcup_team:
    11. return True
    12. return False
    13. def predict(model_dir =root_path+'/LogisticRegression_model.joblib', team_a='France', team_b = 'Mexico'):
    14. if not verify_team_name(team_a):
    15. print(team_a, ' is not correct')
    16. return
    17. if not verify_team_name(team_b) :
    18. print(team_b, ' is not correct')
    19. return
    20. logreg = joblib.load(model_dir)
    21. input_x = [{'Home Team Name': team_a, 'Away Team Name': team_b}]
    22. vec = joblib.load(root_path+"/vec.joblib")
    23. input_x = vec.transform(input_x)
    24. result = logreg.predict(input_x)
    25. print(result)
    26. result1 = logreg.predict_proba(input_x)
    27. print(result1)
    28. print('Probability of ',team_a , ' winning:', '%.3f'%result1[0][2])
    29. print('Probability of Draw:', '%.3f' % result1[0][1])
    30. print('Probability of ', team_b, ' winning:', '%.3f' % result1[0][0])
    31. if __name__ == '__main__':
    32. team_a = 'Ecuador'
    33. team_b = 'Qatar'
    34. predict('models/random_forest_model.joblib', team_a, team_b)
    35. team_a = 'England'
    36. team_b = ' Iran'
    37. predict('models/random_forest_model.joblib', team_a, team_b)

    7. 2018后新的数据

    2018年和2022年的数据可以下面百度网盘下载
    链接: https://pan.baidu.com/s/1fe_z6kRXB8T69wx1HBxO8g 提取码: o55d

    8. 个人总结

            特征少的可怜,如果可以加一些球员的信息和状态的特征会更好,数据也相对太少,如果可以把欧洲杯,亚洲杯,非洲杯和美洲杯的每届数据加入进来就好了。数据有点旧(1930年后所有数据),这也是没办法,因为没有其它数据了。

            如果预测的是比分,是不是也可以用分类做呢?比如最多可以踢入4个球每个队,当然像西班牙队可以踢进去7个球太罕见了,我们可以忽略不计,哈哈。那就是25个类别。效果会不会好呢?有待检验。

  • 相关阅读:
    飞利浦CT的AI重建技术
    『功能项目』切换职业面板【48】
    漫谈ERP和MES数据打通
    将cookie字符串转成editthiscookie插件的json格式
    罗永浩讽刺iPhone“那么伟大又那么不要脸”;北欧囚犯正在训练AI大模型;ChatGPT治怪病丨RTE开发者日报 Vol.51
    链表小题.Play Train AtCoder - abc225_d
    苹果手机H5 video标签播放视频问题以及.mov格式处理方案
    维度建模--如何设计事实表与维表 以及如何评估数仓模型
    Go-Excelize API源码阅读(十四)——GetSheetFormatPr
    手写一个线程池
  • 原文地址:https://blog.csdn.net/keeppractice/article/details/128022027