• 【Hive】CDPHiveNULL值排序前后的问题


    最近公司在从本地大数据环境迁移到CDP的集群。在跑相同的SQL代码时,两边对比发现数据有问题,然后就开始查找原因:在对数据进行分组排序时,两个环境的NULL值默认排序不同。

    排错

    • 1、首先我们执行相同的SQL,对比了两边SQL的执行计划,
    select id
         , times
         , row_number() over (partition by id order by times desc) as rn
    from (select 1 as id, 1659258809386 as times
          union all
          select 1 as id, null as times
          ) t1;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    CDP Hive的执行计划如下

    Plan optimized by CBO.
    
    Vertex dependency in root stage
    Map 1 <- Union 2 (CONTAINS)
    Map 4 <- Union 2 (CONTAINS)
    Reducer 3 <- Union 2 (SIMPLE_EDGE)
    
    Stage-0
      Fetch Operator
        limit:-1
        Stage-1
          Reducer 3
          File Output Operator [FS_12]
            Select Operator [SEL_10] (rows=2 width=16)
               Output:["_col0","_col1","_col2"]"
              PTF Operator [PTF_9] (rows=2 width=12)
                 Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
                Select Operator [SEL_8] (rows=2 width=12)
                   Output:["_col0","_col1"]"
                <-Union 2 [SIMPLE_EDGE]
                  <-Map 1 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]"
                        Select Operator [SEL_1] (rows=1 width=12)
                           Output:["_col1"]"
                          TableScan [TS_0] (rows=1 width=1)
                  <-Map 4 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]
                        Select Operator [SEL_3] (rows=1 width=8)
                           Output:["_col1"]
                          TableScan [TS_2] (rows=1 width=1)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37

    本地的执行计划结果如下:

    Plan not optimized by CBO.
    
    Vertex dependency in root stage
    Map 1 <- Union 2 (CONTAINS)
    Map 4 <- Union 2 (CONTAINS)
    Reducer 3 <- Union 2 (SIMPLE_EDGE)
    
    Stage-0
      Fetch Operator
        limit:-1
        Stage-1
          Reducer 3
          File Output Operator [FS_12]
            Select Operator [SEL_10] (rows=2 width=16)
               Output:["_col0","_col1","_col2"]"
              PTF Operator [PTF_9] (rows=2 width=12)
                 Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
                Select Operator [SEL_8] (rows=2 width=12)
                   Output:["_col0","_col1"]"
                <-Union 2 [SIMPLE_EDGE]
                  <-Map 1 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]"
                        Select Operator [SEL_1] (rows=1 width=12)
                           Output:["_col1"]"
                          TableScan [TS_0] (rows=1 width=1)
                  <-Map 4 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]
                        Select Operator [SEL_3] (rows=1 width=8)
                           Output:["_col1"]
                          TableScan [TS_2] (rows=1 width=1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    发现是对NULL值的排序顺序两边不一致
    对比结果如下:

    在这里插入图片描述

    • 2、于是去官网确认默认的配置什么
      本地集群用的是Hive的2.7.3版本,在Hive官网发现ISSUE: HIVE-12994

    SQL standard does not specify the behavior by default. Currently in Hive, null values sort as if lower than any non-null value; that is, NULLS FIRST is the default for ASC order, and NULLS LAST for DESC order.
    即:默认情况下,SQL标准不会指定该行为。目前在Hive中,空值排序似乎低于任何非空值;即ASC顺序为NULLS FIRST, DESC顺序为NULLS LAST
    如下图所示:
    在这里插入图片描述
    但是CDP的Hive默认NULL值排序是NULLS FIRST。

    • 3、解决办法
        1)改动代码在排序语句后加nulls last ,或者使用nvl()函数对空值做处理
        2)修改配置不修改代码:
      CDP中默认·hive.default.nulls.last=true·,如果您需要在集群中永久设置·hive.default.nulls.last=false·
      您可如下设置:
      CM - Hive on tez - 配置 - 搜索hive_hs2_config_safety_valve - 添加hive.default.nulls.last=false - 保存并重启服务。
  • 相关阅读:
    uniapp 车牌号输入 车牌号键盘 新能源车牌键盘 特殊车辆车牌键盘
    AAC算法
    UE4(Unreal Engine4)虚幻引擎视口布局
    mysql 意向锁
    05 需求分析阶段
    02UEc++【打飞艇:无人机运动】
    小车测距避障-通过串口(可蓝牙)控制
    C语言的函数
    LeetCode算法题解|LeetCode435. 无重叠区间、LeetCode763. 划分字母区间、LeetCode56. 合并区间
    【听如子说】-python模块系列-AIS编解码Pyais
  • 原文地址:https://blog.csdn.net/weixin_42804692/article/details/126130321