• HIVE内置函数hash() -- 源码解析


    HIVE内置函数hash() – 源码解析

    在hash()值求解中,hive支持多类型,多参数的哈希值求解,那么底层是如何实现的?

    首先HIVE提供的hash()内置函数的源码是怎么要求的?

    hash内置函数在类中表明:

    org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash

    一、initialize方法

    @Override
      public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentTypeException {
    
        argumentOIs = arguments;
        return PrimitiveObjectInspectorFactory.writableIntObjectInspector;
      }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    在初始化参数的时候,并没有做任何的校验,直接返回了一个Int类型,所以在编译阶段是可以随意通过的,无论传入什么参数

    二、evaluate方法

    private final IntWritable result = new IntWritable();
    
      @Override
      public Object evaluate(DeferredObject[] arguments) throws HiveException {
        Object[] fieldValues = new Object[arguments.length];
        // 将参数取出
        for(int i = 0; i < arguments.length; i++) {
          fieldValues[i] = arguments[i].get();
        }
        int r = ObjectInspectorUtils.getBucketHashCode(fieldValues, argumentOIs);
        result.set(r);
        return result;
      }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    可以看到在处理阶段是可以传入多个参数的,也就是说hash()这个函数可以传入多个参数,其实最主要的处理方法是调用了ObjectInspectorUtils.getBucketHashCode(fieldValues, argumentOIs);

     int r = ObjectInspectorUtils.getBucketHashCode(fieldValues, argumentOIs);
    
    • 1

    getBucketHashCode方法

    点进去看一下这个方法是什么?

    public static int getBucketHashCode(Object[] bucketFields, ObjectInspector[] bucketFieldInspectors) {
            int hashCode = 0;
    				// 对传入的object,遍历调用hashCode方法
            for(int i = 0; i < bucketFields.length; ++i) {
                int fieldHash = hashCode(bucketFields[i], bucketFieldInspectors[i]);
                hashCode = 31 * hashCode + fieldHash;
            }
    
            return hashCode;
        }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    hashCode方法

    该方法对传入的object,遍历调用hashCode方法,那么hashCode方法是怎么写的?

    public static int hashCode(Object o, ObjectInspector objIns) {
            if (o == null) {
                return 0;
            } else {
                int r;
                ObjectInspector keyOI;
                int i;
                switch(objIns.getCategory()) {
                // HIVE原始数据类型
                case PRIMITIVE:
                    PrimitiveObjectInspector poi = (PrimitiveObjectInspector)objIns;
                    long a;
                    switch(poi.getPrimitiveCategory()) {
                    case VOID:
                        return 0;
                    case BOOLEAN:
                        return ((BooleanObjectInspector)poi).get(o) ? 1 : 0;
                    case BYTE:
                        return ((ByteObjectInspector)poi).get(o);
                    case SHORT:
                        return ((ShortObjectInspector)poi).get(o);
                    case INT:
                        return ((IntObjectInspector)poi).get(o);
                    case LONG:
                        a = ((LongObjectInspector)poi).get(o);
                        return (int)(a >>> 32 ^ a);
                    case FLOAT:
                        return Float.floatToIntBits(((FloatObjectInspector)poi).get(o));
                    case DOUBLE:
                        a = Double.doubleToLongBits(((DoubleObjectInspector)poi).get(o));
                        return (int)(a >>> 32 ^ a);
                    case STRING:
                        Text t = ((StringObjectInspector)poi).getPrimitiveWritableObject(o);
                        int r = 0;
    
                        for(i = 0; i < t.getLength(); ++i) {
                            r = r * 31 + t.getBytes()[i];
                        }
    
                        return r;
                    case CHAR:
                        return ((HiveCharObjectInspector)poi).getPrimitiveWritableObject(o).hashCode();
                    case VARCHAR:
                        return ((HiveVarcharObjectInspector)poi).getPrimitiveWritableObject(o).hashCode();
                    case BINARY:
                        return ((BinaryObjectInspector)poi).getPrimitiveWritableObject(o).hashCode();
                    case DATE:
                        return ((DateObjectInspector)poi).getPrimitiveWritableObject(o).hashCode();
                    case TIMESTAMP:
                        TimestampWritable t = ((TimestampObjectInspector)poi).getPrimitiveWritableObject(o);
                        return t.hashCode();
                    case INTERVAL_YEAR_MONTH:
                        HiveIntervalYearMonthWritable intervalYearMonth = ((HiveIntervalYearMonthObjectInspector)poi).getPrimitiveWritableObject(o);
                        return intervalYearMonth.hashCode();
                    case INTERVAL_DAY_TIME:
                        HiveIntervalDayTimeWritable intervalDayTime = ((HiveIntervalDayTimeObjectInspector)poi).getPrimitiveWritableObject(o);
                        return intervalDayTime.hashCode();
                    case DECIMAL:
                        return ((HiveDecimalObjectInspector)poi).getPrimitiveWritableObject(o).hashCode();
                    default:
                        throw new RuntimeException("Unknown type: " + poi.getPrimitiveCategory());
                    }
                // 数组类型
                case LIST:
                    r = 0;
                    ListObjectInspector listOI = (ListObjectInspector)objIns;
                    keyOI = listOI.getListElementObjectInspector();
    
                    for(i = 0; i < listOI.getListLength(o); ++i) {
                        r = 31 * r + hashCode(listOI.getListElement(o, i), keyOI);
                    }
    
                    return r;
                // MAP类型
                case MAP:
                    r = 0;
                    MapObjectInspector mapOI = (MapObjectInspector)objIns;
                    keyOI = mapOI.getMapKeyObjectInspector();
                    ObjectInspector valueOI = mapOI.getMapValueObjectInspector();
                    Map<?, ?> map = mapOI.getMap(o);
    
                    Entry entry;
                    for(Iterator var9 = map.entrySet().iterator(); var9.hasNext(); r += hashCode(entry.getKey(), keyOI) ^ hashCode(entry.getValue(), valueOI)) {
                        entry = (Entry)var9.next();
                    }
    
                    return r;
                // STRUCT类型
                case STRUCT:
                    r = 0;
                    StructObjectInspector structOI = (StructObjectInspector)objIns;
                    List<? extends StructField> fields = structOI.getAllStructFieldRefs();
    
                    StructField field;
                    for(Iterator var18 = fields.iterator(); var18.hasNext(); r = 31 * r + hashCode(structOI.getStructFieldData(o, field), field.getFieldObjectInspector())) {
                        field = (StructField)var18.next();
                    }
    
                    return r;
                // 组合类型
                case UNION:
                    UnionObjectInspector uOI = (UnionObjectInspector)objIns;
                    byte tag = uOI.getTag(o);
                    return hashCode(uOI.getField(o), (ObjectInspector)uOI.getObjectInspectors().get(tag));
                default:
                    throw new RuntimeException("Unknown type: " + objIns.getTypeName());
                }
            }
        }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109

    可以看到hashCode根据传进来的参数的检测器类型来计算不同类型的hash数值,然后将计算结果返回。
    hash方法支持的数据类型包括了 基本的数据类型、数组列表、map、struct、组合结构等,如果传入了其他的数据类型,会返回一个运行时异常:
    ("Unknown type: " + objIns.getTypeName())

    TIPS*:注意,在HIVE中是可以使用hash()方法来处理各个类型的结果,但是在Spark中,这个函数就会有些bug存在的。

  • 相关阅读:
    mulesoft Module 7 quiz 解析
    Go Module详解
    猿创征文|Python基础——Visual Studio版本——Web开发
    大数据错误
    CentOS源码更新Linux最新内核
    笔记网站测试报告
    经济数据预测 | Python实现机器学习(MLP、XGBoost)金融市场预测
    面试:类相关---Java、Android有哪些类加载器
    一栈走天下:使用HBuilderX高效搭建Uni-App微信小程序开发环境
    金仓数据库兼容Oracle exp/imp的导出导入工具手册( 5. 附录A:exp导出参数说明)
  • 原文地址:https://blog.csdn.net/weixin_46429290/article/details/125610632