基于python的咖啡数据集分析

工具：jupyter notebook

启动界面：

代码分析：


# -*- coding='utf-8' -*-
'''
功能：基于咖啡数据集的python(引导法)数据分析技术
作者：pegasus
时间：2022/11/22
'''
# 导入数据分析所需要的库文件
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# %matplotlib inline   # %matplotlib inline加在语句块的第一句，作用是图象出现在Notebook里面，而不是一个新窗口里
# 设定一个随机种子
np.random.seed(42)
# 读入CSV格式的数据集
coffee_full = pd.read_csv('coffee_dataset.csv')
# 数据采样，这是你在现实世界中可能得到的唯一数据。
coffee_red = coffee_full.sample(200)
# 打印采样数据
print('现实样本数据：', coffee_red)
# 计算样本中喝咖啡的比例和不喝咖啡的比例
drinks_coffee_scale=coffee_red.groupby(by = 'drinks_coffee').size() / coffee_red.shape[0]
print('喝咖啡和不喝咖啡的比列为：', drinks_coffee_scale)
#在喝咖啡的人中，平均身高是多少？
drinks_coffee_height = coffee_red[coffee_red['drinks_coffee'] == True]['height'].mean()
print('喝咖啡的平均身高是：',drinks_coffee_height)
# 不喝咖啡的人的平均身高是多少？
drinks_not_coffee_height = coffee_red[coffee_red['drinks_coffee'] == False]['height'].mean()
print('不喝咖啡的人的平均身高是：',drinks_not_coffee_height )
# 从你最初的200个样本中模拟出200个“新”个体成为"引导样本"。不喝咖啡的人呢？
coffee_new = coffee_red.sample(200, replace = True)
print('引导样本是：',coffee_new)
# 引导样本中喝咖啡的比例是多少？
new_drinks_coffee_scale = coffee_new.groupby(by = 'drinks_coffee').size() / coffee_new.shape[0]
print('引导样本中喝咖啡和不喝咖啡的比列为：', new_drinks_coffee_scale)
'''
现在模拟你的自举样本10000次，并在每个样本中测量不喝咖啡的人的平均身高。
每个引导样本应该来自200个数据点的第一个样本。绘制分布图，并提取95%置信
区间所需的值。你注意到这个例子中平均值的抽样分布吗？
'''
boot_means = []
for n in range(10000):
    bootsample = coffee_red.sample(200, replace=True)
    bootsample_mean = bootsample[bootsample['drinks_coffee'] == False]['height'].mean()
    boot_means.append(bootsample_mean)
boot_means = np.array(boot_means)
print('模拟样本平均值:',boot_means)
print('模拟样本平均值的平均值：',boot_means.mean())
# 找到一组数的分位数,便于设置置信区间
print('得到的不同分位数是：',np.percentile(boot_means, 2.5), np.percentile(boot_means, 97.5))
# 绘制直方图
plt.hist(boot_means, alpha=0.7);
plt.axvline(np.percentile(boot_means, 2.5), color='red', linewidth=2) ;
plt.axvline(np.percentile(boot_means, 97.5), color='red', linewidth=2) ;
#你的时间间隔是否记录了人群中不喝咖啡的人的实际平均身高？查看总体平均值和95%置信区间提供的两个界限，然后回答下面的最后一个测验问题。
aa = coffee_full.query('drinks_coffee == False')['height'].mean()
print(aa)
 
import statsmodels.stats.api as sm
X1 = coffee_red[coffee_red['drinks_coffee']==True]['height'] 
X2 = coffee_red[coffee_red['drinks_coffee']==False]['height']
cm = sm.CompareMeans(sm.DescrStatsW(X1), sm.DescrStatsW(X2))
print (cm.tconfint_diff(usevar='unequal'))
print(cm.summary())

结果：

绘制混淆矩阵：


import sklearn
import pandas as pd
 
print('sklearn version: ', sklearn.__version__)
print('pandas version: ', pd.__version__)
 
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 0, 1, 0, 1, 0]
y_pred = [1, 1, 1, 0, 1, 0, 1]
 
cm = confusion_matrix(y_true, y_pred)
print ('混淆矩阵是：',cm)
 
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
print ('混淆矩阵2是：',cm)
 
tp = cm[0][0]
fp = cm[1][0]
tn = cm[1][1]
fn = cm[0][1]
 
print("各项指标为:",tp,fp,tn,fn)

结果：

数据集格式内容：

数据集下载地址：

https://download.csdn.net/download/mzl_18353516147/87516856

相关阅读:
Linux磁盘分区中物理卷（PV）、卷组（VG）、逻辑卷（LV）创建和（LVM）管理
改进 hibernate-validator，新一代校验框架 validator 使用介绍 v0.4
微服务9：服务治理来保证高可用
晶振与晶体
C# 看懂这100+行代码,你就真正入门了(经典)
Tesseract OCR安装与简单使用
Redis面试
Azure AKS集群监控告警表达式配置
解析java中的return关键字
批量替换WordPress文章内图片链接

原文地址：https://blog.csdn.net/mzl_18353516147/article/details/127980100