• orc文件的读写及整合hive


    还是先说下背景。

    为啥想到学习orc文件的读写呢? 我们create table的时候stored as orc就好了呀,读写有什么作用呢?

    1.使用datax hdfsreader的时候有时候hdfswriter的写速度过慢,针对的我之前的splitpk,可以一定程度减少这个耗时,但是他慢就是慢,就好像a干活很慢,你现在用10个a干活,比之前肯定快,但是还是慢。

    2.了解orc文件的读写,可以有效的排查问题,例如,decimal字段精度不对,怎么调整,文件大小不是128M怎么做?

    3.更好的吹牛b。

    先说下注意事项hive写orc文件是有两个包的。

    要注意,两者都可以写orc但是有些些的差别。 

    1. <dependency>
    2. <groupId>org.apache.hadoop</groupId>
    3. <artifactId>hadoop-client</artifactId>
    4. <version>2.7.7</version>
    5. </dependency>
    6. <dependency>
    7. <groupId>org.apache.orc</groupId>
    8. <artifactId>orc-core</artifactId>
    9. <version>1.5.4</version>
    10. </dependency>

    如果上面的不行引用 hive-exec.jar 试下。我引用的有点多分不清了。

    以下代码 只用改下main方法的路径,可以直接跑的。

    1. package com.chenchi.learning.fileformat.orc;
    2. /**
    3. *
    4. * org.apache.hadoop
    5. * hadoop-client
    6. * 2.7.7
    7. *
    8. *
    9. * org.apache.orc
    10. * orc-core
    11. * 1.5.4
    12. *
    13. * -----------------------------------
    14. * ©著作权归作者所有:来自51CTO博客作者铁头乔的博客的原创作品,请联系作者获取转载授权,否则将追究法律责任
    15. * ORC 文件层 API 读写 参考了这个的加以拓展
    16. * https://blog.51cto.com/u_15352899/3746656
    17. */
    18. import org.apache.hadoop.conf.Configuration;
    19. import org.apache.hadoop.fs.Path;
    20. import org.apache.hadoop.hive.common.type.HiveDecimal;
    21. import org.apache.hadoop.hive.ql.exec.vector.*;
    22. import org.apache.hadoop.hive.ql.io.sarg.PredicateLeaf;
    23. import org.apache.hadoop.hive.ql.io.sarg.SearchArgumentFactory;
    24. import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
    25. import org.apache.orc.*;
    26. import java.io.File;
    27. import java.io.IOException;
    28. import java.math.BigDecimal;
    29. import java.sql.Timestamp;
    30. import java.text.SimpleDateFormat;
    31. import java.util.Date;
    32. import java.util.UUID;
    33. public class ReadAndWriteOrcTest {
    34. public static void main(String[] args) throws IOException {
    35. ReadAndWriteOrcTest writeOrc = new ReadAndWriteOrcTest();
    36. writeOrc.writeOrc("D:\\install\\code\\learning\\bigdata_learining\\src\\main\\resources\\out\\my-file.orc");
    37. writeOrc.readOrc("D:\\install\\code\\learning\\bigdata_learining\\src\\main\\resources\\out\\my-file.orc");
    38. }
    39. private void writeOrc(String path) throws IOException {
    40. File file = new File(path);
    41. if (file.exists()) file.delete();
    42. Configuration conf = new Configuration();
    43. TypeDescription schema = TypeDescription.createStruct()
    44. .addField("long_value", TypeDescription.createLong())
    45. .addField("double_value", TypeDescription.createDouble())
    46. .addField("boolean_value", TypeDescription.createBoolean())
    47. .addField("string_value", TypeDescription.createString())
    48. .addField("decimal_value",TypeDescription.createDecimal().withScale(18))
    49. .addField("date_value",TypeDescription.createTimestamp())
    50. .addField("timestamp_value",TypeDescription.createTimestamp());
    51. Writer writer = OrcFile.createWriter(new Path(path),
    52. OrcFile.writerOptions(conf)
    53. .setSchema(schema)
    54. .stripeSize(67108864)
    55. .bufferSize(64 * 1024)
    56. .blockSize(128 * 1024 * 1024)
    57. .rowIndexStride(10000)
    58. .blockPadding(true)
    59. .compress(CompressionKind.ZLIB));
    60. //根据 列数和默认的1024 设置创建一个batch
    61. VectorizedRowBatch batch = schema.createRowBatch();
    62. LongColumnVector longVector = (LongColumnVector) batch.cols[0];
    63. DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[1];
    64. LongColumnVector booleanVector = (LongColumnVector) batch.cols[2];
    65. BytesColumnVector stringVector = (BytesColumnVector) batch.cols[3];
    66. DecimalColumnVector decimalVector = (DecimalColumnVector) batch.cols[4];
    67. TimestampColumnVector dateVector = (TimestampColumnVector) batch.cols[5];
    68. TimestampColumnVector timestampVector = (TimestampColumnVector) batch.cols[6];
    69. for (int r = 0; r < 10; ++r) {
    70. int row = batch.size++;
    71. longVector.vector[row] = r;
    72. doubleVector.vector[row] = r;
    73. booleanVector.vector[row] = r %2;
    74. stringVector.setVal(row, UUID.randomUUID().toString().getBytes());
    75. BigDecimal bigDecimal = new BigDecimal((double) r / 3).setScale(18,BigDecimal.ROUND_DOWN);
    76. HiveDecimal hiveDecimal = HiveDecimal.create(bigDecimal).setScale(18);
    77. decimalVector.set(row, hiveDecimal);
    78. long time = new Date().getTime();
    79. Timestamp timestamp = new Timestamp(time);
    80. dateVector.set(row,timestamp);
    81. timestampVector.set(row,timestamp);
    82. if (batch.size == batch.getMaxSize()) {
    83. writer.addRowBatch(batch);
    84. batch.reset();
    85. }
    86. }
    87. if (batch.size != 0) {
    88. writer.addRowBatch(batch);
    89. batch.reset();
    90. }
    91. writer.close();
    92. }
    93. private void readOrc(String path) throws IOException {
    94. Configuration conf = new Configuration();
    95. TypeDescription readSchema = TypeDescription.createStruct()
    96. .addField("long_value", TypeDescription.createLong())
    97. .addField("double_value", TypeDescription.createDouble())
    98. .addField("boolean_value", TypeDescription.createBoolean())
    99. .addField("string_value", TypeDescription.createString())
    100. .addField("decimal_value",TypeDescription.createDecimal().withScale(18))
    101. .addField("date_value",TypeDescription.createTimestamp())
    102. .addField("timestamp_value",TypeDescription.createTimestamp()
    103. );
    104. Reader reader = OrcFile.createReader(new Path(path),
    105. OrcFile.readerOptions(conf));
    106. OrcFile.WriterVersion writerVersion = reader.getWriterVersion();
    107. System.out.println("writerVersion="+writerVersion);
    108. Reader.Options readerOptions = new Reader.Options()
    109. .searchArgument(
    110. SearchArgumentFactory
    111. .newBuilder()
    112. .between("long_value", PredicateLeaf.Type.LONG, 0L,1024L)
    113. .build(),
    114. new String[]{"long_value"}
    115. );
    116. RecordReader rows = reader.rows(readerOptions.schema(readSchema));
    117. VectorizedRowBatch batch = readSchema.createRowBatch();
    118. int count=0;
    119. while (rows.nextBatch(batch)) {
    120. LongColumnVector longVector = (LongColumnVector) batch.cols[0];
    121. DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[1];
    122. LongColumnVector booleanVector = (LongColumnVector) batch.cols[2];
    123. BytesColumnVector stringVector = (BytesColumnVector) batch.cols[3];
    124. DecimalColumnVector decimalVector = (DecimalColumnVector) batch.cols[4];
    125. TimestampColumnVector dateVector = (TimestampColumnVector) batch.cols[5];
    126. TimestampColumnVector timestampVector = (TimestampColumnVector) batch.cols[6];
    127. count++;
    128. if (count==1){
    129. for(int r=0; r < batch.size; r++) {
    130. long longValue = longVector.vector[r];
    131. double doubleValue = doubleVector.vector[r];
    132. boolean boolValue = booleanVector.vector[r] != 0;
    133. String stringValue = stringVector.toString(r);
    134. HiveDecimalWritable hiveDecimalWritable = decimalVector.vector[r];
    135. long time1 = dateVector.getTime(r);
    136. Date date = new Date(time1);
    137. String format = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss").format(date);
    138. long time = timestampVector.time[r];
    139. int nano = timestampVector.nanos[r];
    140. Timestamp timestamp = new Timestamp(time);
    141. timestamp.setNanos(nano);
    142. System.out.println(longValue + ", " + doubleValue + ", " + boolValue + ", " + stringValue+", "+hiveDecimalWritable.getHiveDecimal().toFormatString(18)+", "+format+", "+timestamp);
    143. }
    144. }
    145. }
    146. System.out.println("count="+count);
    147. rows.close();
    148. }
    149. }

    打印结果 可以看到decimal保留了18位小数,date日期ok, timestamp的毫秒也ok

    生成的文件

     ————————————————————————————————————————

    整合hive

     create table test.orc_read(
     long_value bigint ,
     double_value double ,
     boolean_value boolean,
     string_value string,
     decimal_value decimal(38,18),
     date_value date,
     timestamp_value timestamp
     )
     stored as orc;

     

    把写好的文件放到表的指定目录下。

    然后是见证奇迹的时候了,直接select * 可以看到orc文件内容。一切ok ---其实不ok 有问题。

     紧接着我们可以改造datax的hdfsreader。。

     其实就是把这一段换成我们的writer就行,而且我们的write里有个batch是可以控制写出文件的速度的。

  • 相关阅读:
    责任链模式让我的代码精简10倍?
    MySQL学习笔记
    Linux常用命令
    SonarQube学习笔记三:直接使用sonar-scanner扫描器
    LabelImg使用笔记
    如何做好项目经理,你需要知道这些
    【轨道机器人】成功驱动伺服电机(学生电源、DCH调试软件、DH系列伺服驱动器)
    数据标准详细概述-2022
    STM32:GPIO功能描述和工作方式
    Linux常用命令——chroot命令
  • 原文地址:https://blog.csdn.net/cclovezbf/article/details/126643684