• 数据分析案例一使用Python进行红酒与白酒数据数据分析


    源码和数据集链接

    以红葡萄酒为例

    有两个样本:
    winequality-red.csv:红葡萄酒样本
    winequality-white.csv:白葡萄酒样本
    每个样本都有得分从1到10的质量评分,以及若干理化检验的结果

    #理化性质字段名称
    1固定酸度fixed acidity
    2挥发性酸度volatile acidity
    3柠檬酸citric acid
    4残糖residual sugar
    5氯化物chlorides
    6游离二氧化硫free sulfur dioxide
    7总二氧化硫total sulfur dioxide
    8密度density
    9PH值pH
    10硫酸盐sulphates
    11酒精度alcohol
    12质量quality

    导入数据和库依赖

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    # import seaborn as sns
    %matplotlib inline
    plt.style.use('ggplot')
    
    # sep参数默认逗号
    red_df = pd.read_csv('winequality-red.csv', sep=';')
    white_df = pd.read_csv('winequality-white.csv', sep=';')
    
    # 查看表头
    red_df.head()
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur-dioxidedensitypHsulphatesalcoholquality
    07.40.700.001.90.07611.034.00.99783.510.569.45
    17.80.880.002.60.09825.067.00.99683.200.689.85
    27.80.760.042.30.09215.054.00.99703.260.659.85
    311.20.280.561.90.07517.060.00.99803.160.589.86
    47.40.700.001.90.07611.034.00.99783.510.569.45

    修改列名

    发现 total_sulfur-dioxide 这个属性命名不规范,修改一下:

    red_df.rename(columns={"total_sulfur-dioxide":"total_sulfur_dioxide"}, inplace=True)
    
    # 查看修改成功
    red_df.head(5)
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholquality
    07.40.700.001.90.07611.034.00.99783.510.569.45
    17.80.880.002.60.09825.067.00.99683.200.689.85
    27.80.760.042.30.09215.054.00.99703.260.659.85
    311.20.280.561.90.07517.060.00.99803.160.589.86
    47.40.700.001.90.07611.034.00.99783.510.569.45

    回答以下问题

    • 每个数据集中的样本数
    • 每个数据集中的列数
    • 具有缺少值的特征
    • 红葡萄酒数据集中的重复行
    • 数据集中的质量等级唯一值的数量
    • 红葡萄酒数据集的平均密度
    # 查看基本信息
    red_df.info()
    
    
    RangeIndex: 1599 entries, 0 to 1598
    Data columns (total 12 columns):
    fixed_acidity           1599 non-null float64
    volatile_acidity        1599 non-null float64
    citric_acid             1599 non-null float64
    residual_sugar          1599 non-null float64
    chlorides               1599 non-null float64
    free_sulfur_dioxide     1599 non-null float64
    total_sulfur_dioxide    1599 non-null float64
    density                 1599 non-null float64
    pH                      1599 non-null float64
    sulphates               1599 non-null float64
    alcohol                 1599 non-null float64
    quality                 1599 non-null int64
    dtypes: float64(11), int64(1)
    memory usage: 150.0 KB
    
    # 查看样本数量
    len(red_df)
    
    1599
    
    # 数据集中列数
    len(red_df.columns)
    
    12
    
    # 红葡萄酒中重复行的数量
    sum(red_df.duplicated())
    
    240
    
    # 质量的唯一值
    len(red_df['quality'].unique())
    
    6
    
    # 红葡萄酒数据集中的平均密度
    red_df['density'].mean()
    
    0.9967466791744833
    

    合并基本数据集

    # 合并红、白葡萄酒的数据
    
    # 为红葡萄酒数据框创建颜色数组(生成多个新行)
    color_red = np.repeat("red",red_df.shape[0])
    
    # 为白葡萄酒数据框创建颜色数组
    color_white = np.repeat("white", white_df.shape[0])
    
    len(color_red)
    
    1599
    
    red_df['color'] = color_red
    
    # 查看新添加的列,发现添加成功
    red_df.head()
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycolor
    07.40.700.001.90.07611.034.00.99783.510.569.45red
    17.80.880.002.60.09825.067.00.99683.200.689.85red
    27.80.760.042.30.09215.054.00.99703.260.659.85red
    311.20.280.561.90.07517.060.00.99803.160.589.86red
    47.40.700.001.90.07611.034.00.99783.510.569.45red
    white_df["color"] = color_white
    white_df.head()
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycolor
    07.00.270.3620.70.04545.0170.01.00103.000.458.86white
    16.30.300.341.60.04914.0132.00.99403.300.499.56white
    28.10.280.406.90.05030.097.00.99513.260.4410.16white
    37.20.230.328.50.05847.0186.00.99563.190.409.96white
    47.20.230.328.50.05847.0186.00.99563.190.409.96white
    print(len(red_df))
    print(len(white_df))
    
    1599
    4898
    
    # 附加数据框
    wine_df = red_df.append(white_df)
    
    # 查看数据框,检查是否成功
    wine_df.head()
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycolor
    07.40.700.001.90.07611.034.00.99783.510.569.45red
    17.80.880.002.60.09825.067.00.99683.200.689.85red
    27.80.760.042.30.09215.054.00.99703.260.659.85red
    311.20.280.561.90.07517.060.00.99803.160.589.86red
    47.40.700.001.90.07611.034.00.99783.510.569.45red
    wine_df.shape
    
    (6497, 13)
    

    保存合并后的数据集

    # 保存自己的数据集
    wine_df.to_csv("winequality_edited.csv",index=False)
    
    # 设置seaborn的样式
    # sns.set_style("ticks")
    wine_df = pd.read_csv("winequality_edited.csv")
    
    wine_df.shape
    
    (6497, 13)
    

    可视化探索

    • 根据此数据集中的列的直方图,以下哪个特征变量显示为右偏态?固定酸度、总二氧化硫、pH 值、酒精度

    hist方法详解
    subplot返回值理解
    subplot画图详解

    绘制柱状图

    fig, axs = plt.subplots(2, 2, figsize=(8, 8))
    
    #  _ 代表不分配名字的变量
    _ = wine_df.fixed_acidity.plot.hist(ax=axs[0][0], rwidth=0.9)
    _ = wine_df.total_sulfur_dioxide.plot.hist(ax=axs[0][1], rwidth=0.9)
    _ = wine_df.pH.plot.hist(ax=axs[1][0], rwidth=0.9)
    _ = wine_df.alcohol.plot.hist(ax=axs[1][1], rwidth=0.9)
    

    image-20240531115344262

    偏态的判定

    下图依次表示左偏态、正态、右偏态

    image-20240531114914904

    wine_df.skew(axis=0)
    
    fixed_acidity           1.723290
    volatile_acidity        1.495097
    citric_acid             0.471731
    residual_sugar          1.435404
    chlorides               5.399828
    free_sulfur_dioxide     1.220066
    total_sulfur_dioxide   -0.001177
    density                 0.503602
    pH                      0.386839
    sulphates               1.797270
    alcohol                 0.565718
    quality                 0.189623
    dtype: float64
    

    偏度值为正,则为右偏态,说明fixed_acidity、pH、alcohol都是右偏态

    • 根据质量对不同特征变量的散点图,以下哪个最有可能对质量产生积极的影响?_挥发性酸度、残糖、pH 值、酒精度
    x = wine_df[["fixed_acidity", "total_sulfur_dioxide", "pH", "alcohol", "quality"]]
    
    fig, axs = plt.subplots(2, 2, figsize=(12, 8))
    
    _  = x.plot.scatter(y='fixed_acidity', x='quality', ax=axs[0][0], linewidths=0.001, marker='o')
    _  = x.plot.scatter(y='total_sulfur_dioxide', x='quality', ax=axs[0][1], linewidths=0.001, marker='o')
    _  = x.plot.scatter(y='pH', x='quality', ax=axs[1][0], linewidths=0.001, marker='o')
    _  = x.plot.scatter(y='alcohol', x='quality', ax=axs[1][1], linewidths=0.001, marker='o')
    
    # sns.despine()
    

    外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

    从图上看其实并不是很明显,因此采用定量计算的方式,通过计算两个变量之间的相关系数,相关系数越大则越说明有积极影响

    相关系数

    sub_df = wine_df.iloc[:,np.r_[0,6,8,10,11]]
    sub_df.corr()['quality']
    
    fixed_acidity          -0.076743
    total_sulfur_dioxide   -0.041385
    pH                      0.019506
    alcohol                 0.444319
    quality                 1.000000
    Name: quality, dtype: float64
    

    发现alcohol的相关系数最大,说明起到的积极作用最大

    查看平均值

    wine_df.mean()
    
    fixed_acidity             7.215307
    volatile_acidity          0.339666
    citric_acid               0.318633
    residual_sugar            5.443235
    chlorides                 0.056034
    free_sulfur_dioxide      30.525319
    total_sulfur_dioxide    115.744574
    density                   0.994697
    pH                        3.218501
    sulphates                 0.531268
    alcohol                  10.491801
    quality                   5.818378
    dtype: float64
    

    按属性分组

    # 按quality分组,查看每组均值
    wine_df.groupby('quality').mean()
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcohol
    quality
    37.8533330.5170000.2810005.1400000.07703339.216667122.0333330.9957443.2576670.50633310.215000
    47.2888890.4579630.2723154.1537040.06005620.636574103.4328700.9948333.2316200.50564810.180093
    57.3268010.3896140.3077225.8041160.06466630.237371120.8391020.9958493.2121890.5264039.837783
    67.1772570.3138630.3235835.5497530.05415731.165021115.4107900.9945583.2177260.53254910.587553
    77.1289620.2888000.3347644.7316960.04527230.422150108.4986100.9931263.2280720.54702511.386006
    86.8352330.2910100.3325395.3829020.04112434.533679117.5181350.9925143.2232120.51248711.678756
    97.4200000.2980000.3860004.1200000.02740033.400000116.0000000.9914603.3080000.46600012.180000
    # 分别以quality和color为两级索引进行分组,并查看均值
    wine_df.groupby(['quality','color']).mean()
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcohol
    qualitycolor
    3red8.3600000.8845000.1710002.6350000.12250011.00000024.9000000.9974643.3980000.5700009.955000
    white7.6000000.3332500.3360006.3925000.05430053.325000170.6000000.9948843.1875000.47450010.345000
    4red7.7792450.6939620.1741512.6943400.09067912.26415136.2452830.9965423.3815090.59641510.265094
    white7.1294480.3812270.3042334.6282210.05009823.358896125.2791410.9942773.1828830.47613510.152454
    5red8.1672540.5770410.2436862.5288550.09273616.98384756.5139500.9971043.3049490.6209699.899706
    white6.9339740.3020110.3376537.3349690.05154636.432052150.9045980.9952633.1688330.4822039.808840
    6red8.3471790.4974840.2738242.4771940.08495615.71159940.8699060.9966153.3180720.67532910.629519
    white6.8376710.2605640.3380256.4416060.04521735.650591137.0473160.9939613.1885990.49110610.575372
    7red8.8723620.4039200.3751762.7206030.07658814.04522635.0201010.9961043.2907540.74125611.465913
    white6.7347160.2627670.3256255.1864770.03819134.125568125.1147730.9924523.2138980.50310211.367936
    8red8.5666670.4233330.3911112.5777780.06844413.27777833.4444440.9952123.2672220.76777812.094444
    white6.6571430.2774000.3265145.6714290.03831436.720000126.1657140.9922363.2186860.48622911.636000
    9white7.4200000.2980000.3860004.1200000.02740033.400000116.0000000.9914603.3080000.46600012.180000
    # 分组属性不作为索引
    wine_df.groupby(['quality','color'], as_index=False).mean()
    
    qualitycolorfixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcohol
    03red8.3600000.8845000.1710002.6350000.12250011.00000024.9000000.9974643.3980000.5700009.955000
    13white7.6000000.3332500.3360006.3925000.05430053.325000170.6000000.9948843.1875000.47450010.345000
    24red7.7792450.6939620.1741512.6943400.09067912.26415136.2452830.9965423.3815090.59641510.265094
    34white7.1294480.3812270.3042334.6282210.05009823.358896125.2791410.9942773.1828830.47613510.152454
    45red8.1672540.5770410.2436862.5288550.09273616.98384756.5139500.9971043.3049490.6209699.899706
    55white6.9339740.3020110.3376537.3349690.05154636.432052150.9045980.9952633.1688330.4822039.808840
    66red8.3471790.4974840.2738242.4771940.08495615.71159940.8699060.9966153.3180720.67532910.629519
    76white6.8376710.2605640.3380256.4416060.04521735.650591137.0473160.9939613.1885990.49110610.575372
    87red8.8723620.4039200.3751762.7206030.07658814.04522635.0201010.9961043.2907540.74125611.465913
    97white6.7347160.2627670.3256255.1864770.03819134.125568125.1147730.9924523.2138980.50310211.367936
    108red8.5666670.4233330.3911112.5777780.06844413.27777833.4444440.9952123.2672220.76777812.094444
    118white6.6571430.2774000.3265145.6714290.03831436.720000126.1657140.9922363.2186860.48622911.636000
    129white7.4200000.2980000.3860004.1200000.02740033.400000116.0000000.9914603.3080000.46600012.180000
    # 查看分组后pH属性所在列
    wine_df.groupby(['quality','color'], as_index=False)['pH'].mean()
    
    qualitycolorpH
    03red3.398000
    13white3.187500
    24red3.381509
    34white3.182883
    45red3.304949
    55white3.168833
    66red3.318072
    76white3.188599
    87red3.290754
    97white3.213898
    108red3.267222
    118white3.218686
    129white3.308000

    问题 1:某种类型的葡萄酒(红葡萄酒或白葡萄酒)是否代表更高的品质?

    # 用 groupby 计算每个酒类型(红葡萄酒和白葡萄酒)的平均质量
    wine_df.groupby("color")["quality"].mean()
    
    color
    red      5.636023
    white    5.877909
    Name: quality, dtype: float64
    

    发现白葡萄酒的品质高于红葡萄酒

    哪个酸度水平的平均评分最高?

    # 用 Pandas 描述功能查看最小、25%、50%、75% 和 最大 pH 值
    wine_df.pH.describe()
    
    count    6497.000000
    mean        3.218501
    std         0.160787
    min         2.720000
    25%         3.110000
    50%         3.210000
    75%         3.320000
    max         4.010000
    Name: pH, dtype: float64
    
    # 对用于把数据“分割”成组的边缘进行分组
    bin_edges = [2.72, 3.11 ,3.21 ,3.32 ,4.01 ] # 用刚才计算的五个值填充这个列表
    
    # 四个酸度水平组的标签
    bin_names = [ "high", "median_high", "mediam", "low"] # 对每个酸度水平类别进行命名
    
    help(pd.cut)
    
    Help on function cut in module pandas.core.reshape.tile:
    
    cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
        Bin values into discrete intervals.
        
        Use `cut` when you need to segment and sort data values into bins. This
        function is also useful for going from a continuous variable to a
        categorical variable. For example, `cut` could convert ages to groups of
        age ranges. Supports binning into an equal number of bins, or a
        pre-specified array of bins.
        
        Parameters
        ----------
        x : array-like
            The input array to be binned. Must be 1-dimensional.
        bins : int, sequence of scalars, or pandas.IntervalIndex
            The criteria to bin by.
        
            * int : Defines the number of equal-width bins in the range of `x`. The
              range of `x` is extended by .1% on each side to include the minimum
              and maximum values of `x`.
            * sequence of scalars : Defines the bin edges allowing for non-uniform
              width. No extension of the range of `x` is done.
            * IntervalIndex : Defines the exact bins to be used.
        
        right : bool, default True
            Indicates whether `bins` includes the rightmost edge or not. If
            ``right == True`` (the default), then the `bins` ``[1, 2, 3, 4]``
            indicate (1,2], (2,3], (3,4]. This argument is ignored when
            `bins` is an IntervalIndex.
        labels : array or bool, optional
            Specifies the labels for the returned bins. Must be the same length as
            the resulting bins. If False, returns only integer indicators of the
            bins. This affects the type of the output container (see below).
            This argument is ignored when `bins` is an IntervalIndex.
        retbins : bool, default False
            Whether to return the bins or not. Useful when bins is provided
            as a scalar.
        precision : int, default 3
            The precision at which to store and display the bins labels.
        include_lowest : bool, default False
            Whether the first interval should be left-inclusive or not.
        duplicates : {default 'raise', 'drop'}, optional
            If bin edges are not unique, raise ValueError or drop non-uniques.
        
            .. versionadded:: 0.23.0
        
        Returns
        -------
        out : pandas.Categorical, Series, or ndarray
            An array-like object representing the respective bin for each value
            of `x`. The type depends on the value of `labels`.
        
            * True (default) : returns a Series for Series `x` or a
              pandas.Categorical for all other inputs. The values stored within
              are Interval dtype.
        
            * sequence of scalars : returns a Series for Series `x` or a
              pandas.Categorical for all other inputs. The values stored within
              are whatever the type in the sequence is.
        
            * False : returns an ndarray of integers.
        
        bins : numpy.ndarray or IntervalIndex.
            The computed or specified bins. Only returned when `retbins=True`.
            For scalar or sequence `bins`, this is an ndarray with the computed
            bins. If set `duplicates=drop`, `bins` will drop non-unique bin. For
            an IntervalIndex `bins`, this is equal to `bins`.
        
        See Also
        --------
        qcut : Discretize variable into equal-sized buckets based on rank
            or based on sample quantiles.
        pandas.Categorical : Array type for storing data that come from a
            fixed set of values.
        Series : One-dimensional array with axis labels (including time series).
        pandas.IntervalIndex : Immutable Index implementing an ordered,
            sliceable set.
        
        Notes
        -----
        Any NA values will be NA in the result. Out of bounds values will be NA in
        the resulting Series or pandas.Categorical object.
        
        Examples
        --------
        Discretize into three equal-sized bins.
        
        >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
        ... # doctest: +ELLIPSIS
        [(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
        Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
        
        >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
        ... # doctest: +ELLIPSIS
        ([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
        Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
        array([0.994, 3.   , 5.   , 7.   ]))
        
        Discovers the same bins, but assign them specific labels. Notice that
        the returned Categorical's categories are `labels` and is ordered.
        
        >>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
        ...        3, labels=["bad", "medium", "good"])
        [bad, good, medium, medium, good, bad]
        Categories (3, object): [bad < medium < good]
        
        ``labels=False`` implies you just want the bins back.
        
        >>> pd.cut([0, 1, 1, 2], bins=4, labels=False)
        array([0, 1, 1, 3])
        
        Passing a Series as an input returns a Series with categorical dtype:
        
        >>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
        ...               index=['a', 'b', 'c', 'd', 'e'])
        >>> pd.cut(s, 3)
        ... # doctest: +ELLIPSIS
        a    (1.992, 4.667]
        b    (1.992, 4.667]
        c    (4.667, 7.333]
        d     (7.333, 10.0]
        e     (7.333, 10.0]
        dtype: category
        Categories (3, interval[float64]): [(1.992, 4.667] < (4.667, ...
        
        Passing a Series as an input returns a Series with mapping value.
        It is used to map numerically to intervals based on bins.
        
        >>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
        ...               index=['a', 'b', 'c', 'd', 'e'])
        >>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False)
        ... # doctest: +ELLIPSIS
        (a    0.0
         b    1.0
         c    2.0
         d    3.0
         e    4.0
         dtype: float64, array([0, 2, 4, 6, 8]))
        
        Use `drop` optional when bins is not unique
        
        >>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True,
        ...    right=False, duplicates='drop')
        ... # doctest: +ELLIPSIS
        (a    0.0
         b    1.0
         c    2.0
         d    3.0
         e    3.0
         dtype: float64, array([0, 2, 4, 6, 8]))
        
        Passing an IntervalIndex for `bins` results in those categories exactly.
        Notice that values not covered by the IntervalIndex are set to NaN. 0
        is to the left of the first bin (which is closed on the right), and 1.5
        falls between two bins.
        
        >>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)])
        >>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
        [NaN, (0, 1], NaN, (2, 3], (4, 5]]
        Categories (3, interval[int64]): [(0, 1] < (2, 3] < (4, 5]]
    

    # 创建 acidity_levels 列
    wine_df['acidity_levels'] = pd.cut(wine_df['pH'], bin_edges, labels=bin_names)
    
    # 检查该列是否成功创建
    wine_df.head()
    
    fixed_acidityvolatile_aciditycitric_acidresidual_sugarchloridesfree_sulfur_dioxidetotal_sulfur_dioxidedensitypHsulphatesalcoholqualitycoloracidity_levels
    07.40.700.001.90.07611.034.00.99783.510.569.45redlow
    17.80.880.002.60.09825.067.00.99683.200.689.85redmedian_high
    27.80.760.042.30.09215.054.00.99703.260.659.85redmediam
    311.20.280.561.90.07517.060.00.99803.160.589.86redmedian_high
    47.40.700.001.90.07611.034.00.99783.510.569.45redlow
    # 用 groupby 计算每个酸度水平的平均质量
    wine_df.groupby("acidity_levels")['quality'].mean()
    
    acidity_levels
    high           5.783343
    median_high    5.784540
    mediam         5.850832
    low            5.859593
    Name: quality, dtype: float64
    

    发现酸度越低,质量评分就越好

    # 保存更改,供下一段使用
    wine_df.to_csv('winequality_edited_al.csv', index=False)
    

    酒精含量高的酒是否评分较高?

    # 获取酒精含量的中位数
    alcohol_median = wine_df.alcohol.median()
    
    wine_df.head();
    
    # 选择酒精含量小于中位数的样本
    low_alcohol = wine_df.query("alcohol < @alcohol_median")
    
    # 选择酒精含量大于等于中位数的样本
    high_alcohol = wine_df.query("alcohol >= @alcohol_median")
    
    # 获取低酒精含量组和高酒精含量组的平均质量评分
    print("低浓度酒精:",low_alcohol.quality.mean())
    print("高浓度酒精:", high_alcohol.quality.mean())
    
    低浓度酒精: 5.475920679886686
    高浓度酒精: 6.146084337349397
    

    发现高浓度酒精的质量评级更高

    口感较甜的酒是否评分较高?

    # 获取残留糖分的中位数
    sugar_median = wine_df["residual_sugar"].median()
    
    # 选择残留糖分小于中位数的样本
    low_sugar = wine_df.query("residual_sugar < @sugar_median")
    
    # 选择残留糖分大于等于中位数的样本
    high_sugar = wine_df.query("residual_sugar >= @sugar_median")
    
    # 确保这些查询中的每个样本只出现一次
    num_samples = wine_df.shape[0]
    num_samples == low_sugar['quality'].count() + high_sugar['quality'].count() # 应为真
    
    True
    
    # 获取低糖分组和高糖分组的平均质量评分
    print("高糖分质量评分:",high_sugar.quality.mean())
    print("低糖分质量评分:",low_sugar.quality.mean())
    
    高糖分质量评分: 5.82782874617737
    低糖分质量评分: 5.808800743724822
    

    发现高糖分的酒质量评分更高

    类和质量图

    Seaborn绘图示例
    Pandas可视化文档

    首先查看一下两种酒的质量均值

    colors = ['red','white']
    color_means = wine_df.groupby('color')['quality'].mean()
    color_means.plot(kind='bar', title='Average Wine Quality by Color', color=colors, alpha=.8)
    plt.xlabel('colors', fontsize=18);
    plt.ylabel('Quality', fontsize=18);
    

    output_79_0

    进一步按质量和颜色分组查看

    counts = wine_df.groupby(['quality', 'color']).count()['pH']
    counts.plot(kind='bar', title='Counts by Wine Color and quality', color=counts.index.get_level_values(1), alpha=.7)
    plt.xlabel('Quality and Color', fontsize=18)
    plt.ylabel('Count', fontsize=18)
    
    Text(0, 0.5, 'Count')
    

    output_81_1

    但红酒和白酒的样本数本来就相差较大,所以我们查看比例才更准确。

    totals = wine_df.groupby('color').count()['pH']
    counts = wine_df.groupby(['quality', 'color']).count()['pH']
    proportions = counts / totals
    proportions.plot(kind='bar', title='Counts by Wine Color and quality',color=counts.index.get_level_values(1), alpha=.7)
    plt.xlabel('Quality and Color', fontsize=18)
    plt.ylabel('Proportions', fontsize=18)
    
    Text(0, 0.5, 'Proportions')
    

    output_83_1

    # 用 Matplotlib 创建柱状图

    pyplot 的 bar 功能中有两个必要参数:条柱的 x 坐标和条柱的高度。

    plt.bar([1, 2, 3], [224, 620, 425], color='blue');
    

    output_86_0

    可以利用 pyplot 的 xticks 功能,或通过在 bar 功能中指定另一个参数,指定 x 轴刻度标签。以下两个框的结果相同。

    # 绘制条柱
    plt.bar([1, 2, 3], [224, 620, 425])
    
    # 为 x 轴指定刻度标签及其标签
    plt.xticks([1, 2, 3], ['a', 'b', 'c']);
    

    output_88_0

    # 用 x 轴的刻度标签绘制条柱
    plt.bar([1, 2, 3], [224, 620, 425], tick_label=['a', 'b', 'c']);
    

    output_89_0

    用以下方法设置轴标题和标签。

    plt.bar([1, 2, 3], [224, 620, 425], tick_label=['a', 'b', 'c'])
    plt.title('Some Title')
    plt.xlabel('Some X Label')
    plt.ylabel('Some Y Label');
    


    output_91_0

    # example
    import matplotlib.pyplot as plt
    import numpy as np
    
    x = np.linspace(0, 1, 10)
    number = 5
    cmap = plt.get_cmap('gnuplot')
    colors = [cmap(i) for i in np.linspace(0, 1, number)]
    
    for i, color in enumerate(colors, start=1):
        plt.plot(x, i * x + i, color=color, label='$y = {i}x + {i}$'.format(i=i))
    plt.legend(loc='best')
    plt.show()
    


    外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

  • 相关阅读:
    YOLOv6:又快又准的目标检测框架开源啦
    网络安全70部学员第二阶段项目验收顺利结束
    实现资源利用率达60% 云原生技术开启节能减排新思路
    简单YUV数据转换
    Shellcode——绕过31
    JAVA知识点笔记
    解决electron设置透明背景后,引入element-plus样式问题
    LeetCode 1684. 统计一致字符串的数目
    linux mysql 安装
    java通过j2v8调用js方法
  • 原文地址:https://blog.csdn.net/No_Name_Cao_Ni_Mei/article/details/139348480