• Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别


    Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别

    目录

    深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

    读取源码

    理解源代码

    data与raw_data对比结果

    X.shape 

    X_display.shape 


    深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

    1. X,y = shap.datasets.adult()
    2. X_display,y_display = shap.datasets.adult(display=True)

    读取源码

    1. def adult(display=False):
    2. """ Return the Adult census data in a nice package. """
    3. dtypes = [
    4. ("Age", "float32"), ("Workclass", "category"), ("fnlwgt", "float32"),
    5. ("Education", "category"), ("Education-Num", "float32"), ("Marital Status", "category"),
    6. ("Occupation", "category"), ("Relationship", "category"), ("Race", "category"),
    7. ("Sex", "category"), ("Capital Gain", "float32"), ("Capital Loss", "float32"),
    8. ("Hours per week", "float32"), ("Country", "category"), ("Target", "category")
    9. ]
    10. raw_data = pd.read_csv(
    11. cache(github_data_url + "adult.data"),
    12. names=[d[0] for d in dtypes],
    13. na_values="?",
    14. dtype=dict(dtypes)
    15. )
    16. data = raw_data.drop(["Education"], axis=1) # redundant with Education-Num
    17. filt_dtypes = list(filter(lambda x: not (x[0] in ["Target", "Education"]), dtypes))
    18. data["Target"] = data["Target"] == " >50K"
    19. rcode = {
    20. "Not-in-family": 0,
    21. "Unmarried": 1,
    22. "Other-relative": 2,
    23. "Own-child": 3,
    24. "Husband": 4,
    25. "Wife": 5
    26. }
    27. for k, dtype in filt_dtypes:
    28. if dtype == "category":
    29. if k == "Relationship":
    30. data[k] = np.array([rcode[v.strip()] for v in data[k]])
    31. else:
    32. data[k] = data[k].cat.codes
    33. if display:
    34. return raw_data.drop(["Education", "Target", "fnlwgt"], axis=1), data["Target"].values
    35. return data.drop(["Target", "fnlwgt"], axis=1), data["Target"].values

    理解源代码

     

     

    data与raw_data对比结果

    结论:data,是基于raw_data读入的csv文件数据,为新定义的新数据drop了3列,并进行了数据的部分处理;而raw_data为原始数据,从csv读入,仅经过drop了3列,其余原封不同输出数据。
    意义

    X.shape 

    1. (32561, 12) X.shape
    2. age workclass ... hours-per-week native-country
    3. 0 39 State-gov ... 40 United-States
    4. 1 50 Self-emp-not-inc ... 13 United-States
    5. 2 38 Private ... 40 United-States
    6. 3 53 Private ... 40 United-States
    7. 4 28 Private ... 40 Cuba
    8. ... ... ... ... ... ...
    9. 32556 27 Private ... 38 United-States
    10. 32557 40 Private ... 40 United-States
    11. 32558 58 Private ... 40 United-States
    12. 32559 22 Private ... 20 United-States
    13. 32560 52 Self-emp-inc ... 40 United-States
    14. [32561 rows x 12 columns]
    ageworkclasseducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
    039State-gov13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States
    150Self-emp-not-inc13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States
    238Private9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States
    353Private7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States
    428Private13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba
    537Private14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States
    649Private5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica
    752Self-emp-not-inc9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States
    831Private14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States
    942Private13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States

    X_display.shape 

    1. (32561, 12) X_display.shape
    2. age workclass ... hours-per-week native-country
    3. 0 39 State-gov ... 40 United-States
    4. 1 50 Self-emp-not-inc ... 13 United-States
    5. 2 38 Private ... 40 United-States
    6. 3 53 Private ... 40 United-States
    7. 4 28 Private ... 40 Cuba
    8. ... ... ... ... ... ...
    9. 32556 27 Private ... 38 United-States
    10. 32557 40 Private ... 40 United-States
    11. 32558 58 Private ... 40 United-States
    12. 32559 22 Private ... 20 United-States
    13. 32560 52 Self-emp-inc ... 40 United-States
    14. [32561 rows x 12 columns]
    ageworkclasseducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
    039State-gov13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States
    150Self-emp-not-inc13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States
    238Private9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States
    353Private7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States
    428Private13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba
    537Private14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States
    649Private5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica
    752Self-emp-not-inc9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States
    831Private14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States
    942Private13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States

  • 相关阅读:
    【CKS】考试之 kube-bench CIS 基准测试
    C++——stack和queue
    实践案例丨CenterNet-Hourglass论文复现
    Git 之七 详解 Github、Gitee 配置 SSH、GPG 及 使用方式
    【人工智能实验】A*算法求解8数码问题 golang
    C++(20):通过[[likely]]和[[unlikely]]优化编译switch
    App性能指标(安装、冷启动、卸载、平均内存/cpu/fps/net)测试记录
    【实战】 七、Hook,路由,与 URL 状态管理(下) —— React17+React Hook+TS4 最佳实践,仿 Jira 企业级项目(十三)
    springboot+高校学生实习档案管理 毕业设计-附源码221508
    2022.7.11-7.17 AI行业周刊(第106期):竭尽全力,努力就好
  • 原文地址:https://blog.csdn.net/qq_41185868/article/details/125611687