Hadoop (十五) --------- Hadoop 数据压缩

一、概述

压缩的优缺点

优点：以减少磁盘IO、减少磁盘存储空间。
缺点：增加 CPU 开销

压缩原则

运算密集型 Job，少用压缩
IO 密集型 Job，多用压缩

二、MR 支持的压缩编码

压缩算法对比介绍

在这里插入图片描述

压缩性能的比较

在这里插入图片描述

http://google.github.io/snappy/

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

三、压缩方式选择

压缩方式选择时重点考虑：压缩/解压缩速度、压缩率(压缩后存储大小) 、压缩后是否可以支持切片。

Gzip 压缩

优点：压缩率比较高
缺点：不支持 Split；压缩/解压速度一般

Bzip2 压缩

优点：压缩率高，支持 Split
缺点：压缩 / 解压速度慢

Lzo 压缩

优点：压缩 / 解压速度比较快，支持 Split
缺点：压缩率一般，想支持切片需要额外创建索引

Snappy 压缩

优点：压缩和解压缩速度快
缺点：不支持 Split，压缩率一般

四、压缩位置选择

压缩可以在 MapReduce 作用的任意阶段启用。

在这里插入图片描述

五、压缩参数配置

为了支持多种压缩/解压缩算法，Hadoop 引入了编码 / 解码器

在这里插入图片描述

要在 Hadoop 中启用压缩，可以配置如下参数

在这里插入图片描述

六、Map 输出端采用压缩

即使你的 MapReduce 的输入输出文件都是未压缩的文件，你仍然可以对 Map 任务的中间结果输出做压缩，因为它要写在硬盘并且通过网络传输到Reduce节点，对其压缩可以提高很多性能，这些工作只要设置两个属性即可。

Mapper 与 Reducer 代码保持不变，更改 Driver 中代码即可

package com.fancy.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;	
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		Configuration conf = new Configuration();

		// 开启map端输出压缩
		conf.setBoolean("mapreduce.map.output.compress", true);

		// 设置map端输出压缩方式
		conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class,CompressionCodec.class);

		Job job = Job.getInstance(conf);

		job.setJarByClass(WordCountDriver.class);

		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		boolean result = job.waitForCompletion(true);

		System.exit(result ? 0 : 1);
	}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

七、Reduce 输出端采用压缩

还是只需要修改驱动

package com.fancy.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.DefaultCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.Lz4Codec;
import org.apache.hadoop.io.compress.SnappyCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		
		Configuration conf = new Configuration();
		
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(WordCountDriver.class);
		
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		// 设置reduce端输出压缩开启
		FileOutputFormat.setCompressOutput(job, true);

		// 设置压缩的方式
	    FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class); 
//	    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); 
//	    FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class); 
	    
		boolean result = job.waitForCompletion(true);
		
		System.exit(result?0:1);
	}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

相关阅读:
pyenv安装python，Makefile报错
 springboot个人博客系统
 【Java开源项目】消息推送平台发送一条短信
 【SVN】使用TortoiseGit删除Git分支
 11-JavaSE基础巩固练习：面向对象综合训练（文字版格斗游戏、对象数组练习）
安卓中轻量级数据存储方案分析探讨
 C学生数据库_将链表保存进数据库
 dotnet7 aot编译实战
 vue3-项目快速搭建和初始化
 基于JAVA物业后台管理系统计算机毕业设计源码+系统+mysql数据库+lw文档+部署
原文地址：https://blog.csdn.net/m0_51111980/article/details/125869153

Hadoop (十五) --------- Hadoop 数据压缩

目录

一、概述

二、MR 支持的压缩编码

三、压缩方式选择

四、压缩位置选择

五、压缩参数配置

六、Map 输出端采用压缩

七、Reduce 输出端采用压缩