Sqoop (四) --------- 配置解析简明版

一、直接导入 HDFS

A、全表导入(部分导入)

bin/sqoop import \

##连接的关系型数据库的url,用户名，密码
--connect jdbc:mysql://hadoop102:3306/test \
--username root \
--password 123 \

##连接的表
--table t_emp \

##导出数据在 hdfs 上存放路径
--target-dir /sqoopTest \

##如果路径已存在则先删除
--delete-target-dir \

##导入到Hdfs上后，每个字段使用什么参数进行分割
--fields-terminated-by "\t" \

##要启动几个MapTask，默认4个
--num-mappers 2 \

##数据集根据哪个字段进行切分，切分后每个MapTask负责一部分
--split-by id \

##要实现部分导入，加入下面的参数，表示导入哪些列

##columns中如果涉及到多列，用逗号分隔，分隔时不要添加空格
--columns id,name,age
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

B、使用 sqoop 关键字筛选查询导入数据

bin/sqoop import \

--connect jdbc:mysql://hadoop102:3306/test \

--username root \

--password 123 \

--table t_emp \

##指定过滤的where语句,where语句最好使用引号包裹

--where 'id>6' \

--target-dir /sqoopTest \

--delete-target-dir \

--fields-terminated-by "\t" \

--num-mappers 1 \

--split-by id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

C、使用查询语句导入

bin/sqoop import \

--connect jdbc:mysql://hadoop102:3306/test \

--username root \

--password 123 \

##查询语句最好使用单引号

##如果query后使用的是双引号，则$CONDITIONS前必须加转移符，防止shell识别为自己的变量

--query 'select * from t_emp where id>3 and $CONDITIONS' \

--target-dir /sqoopTest \

--delete-target-dir \

--fields-terminated-by "\t" \

--num-mappers 1 \

--split-by id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

注意：

1、如果使用了 --query，就不能指定 --table，和 --columns 和 --where

--query 和 --table一定不能同时存在！

--where 和 --query 同时存在时，--where失效

--columns 和 --query 同时存在时，还有效！

2、--query 必须跟 --target-dir

二、导入到 Hive

bin/sqoop import \

--connect jdbc:mysql://hadoop102:3306/test \
--username root \
--password 123 \

--query 'select * from t_emp where id>3 and $CONDITIONS' \

--target-dir /sqoopTest \

##如果不限定分隔符，那么hive存储的数据将不带分隔符，之后再想操作很麻烦，所以建议加上
--fields-terminated-by "\t" \
--delete-target-dir \

##导入到hive
--hive-import \

##是否覆盖写，不加这个参数就是追加写
--hive-overwrite \

##指定要导入的hive的表名
--hive-table t_emp \

--num-mappers 1 \

--split-by id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

原理还是分俩步：先把数据从关系数据库里导到 hdfs 中，然后再从 hdfs 中导到 hive 中，此时hdfs 中的文件会被删除

注意：如果 hive 中没表会自动创建表，但是类型是自动生成的，所以还是建议手动创建

也可以分俩步走：

先导入hdfs

#!/bin/bash

import_data(){

$sqoop import \

--connect jdbc:mysql://hadoop102:3306/gmall \

--username root \

--password 123 \

--target-dir /origin_data/gmall/db/$1/$do_date \

--delete-target-dir \

--query "$2 and \$CONDITIONS" \

--num-mappers 1 \

--fields-terminated-by '\t' \

# 使用压缩，和指定压缩格式为lzop

--compress \

--compression-codec lzop \

# 将String类型和非String类型的空值替换为\N,方便Hive读取

--null-string '\\N' \

--null-non-string '\\N'
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

然后利用 load data 命令导入 hive

注意：这里使用到了空值处理 —— Hive中的Null在底层是以“\N”来存储，而MySQL中的Null在底层就是Null，为了保证数据两端的一致性。在导出数据时采用–input-null-string和–input-null-non-string两个参数。导入数据时采用–null-string和–null-non-string。

三、导入到 Hbase

bin/sqoop import \

--connect jdbc:mysql://hadoop102:3306/test \

--username root \

--password 123 \

--query 'select * from t_emp where id>3 and $CONDITIONS' \

--target-dir /sqoopTest \

--delete-target-dir \

##表不存在是否创建

--hbase-create-table \

##hbase中的表名

--hbase-table "t_emp" \

##将导入数据的哪一列作为rowkey

--hbase-row-key "id" \

##导入的列族

--column-family "info" \

--num-mappers 2 \

--split-by id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

如果要多列族导入，只能多次运行命令，一次导入一个列族。

三、导出

将 hdfs 上的数据导出到关系型数据库中

1. SQL中表为空表时

bin/sqoop export \

--connect 'jdbc:mysql://hadoop102:3306/test?useUnicode=true&characterEncoding=utf-8' \

--username root \

--password 123 \

##导出的表名，需要自己提前创建好

--table t_emp2 \

--num-mappers 1 \

##hdfs上导出的数据的路径

--export-dir /user/hive/warehouse/t_emp \

##hdfs上数据的分隔符

--input-fields-terminated-by "\t"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

2. 表不为空表时

如果插入的数据的主键和表中已有数据的主键冲突，那么会报错

Duplicate entry ‘5’ for key ‘PRIMARY’
1

如果在 SQL 下，可以使用

INSERT INTO t_emp2 VALUE(5,'jack',30,3,1111)
ON DUPLICATE KEY UPDATE NAME=VALUES(NAME),deptid=VALUES(deptid),
empno=VALUES(empno);
1
2
3

意为

指定当插入时，主键重复时时，对于重复的记录，只做更新，不做插入！

而用 sqoop 时，则可以启用以下俩种模式

① updateonly 模式

bin/sqoop export \

--connect 'jdbc:mysql://hadoop103:3306/mydb?useUnicode=true&characterEncoding=utf-8' \

--username root \

--password 123456 \

--table t_emp2 \

--num-mappers 1 \

--export-dir /hive/t_emp \

--input-fields-terminated-by "\t" \

--update-key id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

利用 --update-key 字段，表示主键重复时会进行更新，但是主键不重复的时候，数据不会插入进来

② allowinsert 模式

bin/sqoop export \

--connect 'jdbc:mysql://hadoop103:3306/mydb?useUnicode=true&characterEncoding=utf-8' \

--username root \

--password 123456 \

--table t_emp2 \

--num-mappers 1 \

--export-dir /hive/t_emp \

--input-fields-terminated-by "\t" \

--update-key id \

--update-mode allowinsert
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

表示主键重复时会进行更新，主键不重复的时候，数据也会插入进来

3. 如何查看导出命令的具体实现

配置/etc/my.cnf

bin/sqoop export \

--connect 'jdbc:mysql://hadoop103:3306/mydb?useUnicode=true&characterEncoding=utf-8' \

--username root \

--password 123456 \

--table t_emp2 \

--num-mappers 1 \

--export-dir /hive/t_emp \

--input-fields-terminated-by "\t" \

--update-key id \

--update-mode allowinsert
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

相关阅读:
Vuex 和 Redux 的区别？
SpringCloud原生组件之Ribbon负载均衡和远程调用
 自定义Graph Component：1.2-其它Tokenizer具体实现
 Jmeter发送webService请求并压测
 k8s.gcr.io/kube-state-metrics/kube-state-metrics 拉取镜像失败问题解决
 Synchronized
面试系列MySql：索引优化
 ant框架下 a-input-number组件的宽度问题
 Ubuntu18.04 系统没有声音输出的解决过程
 教育课堂小程序，三分钟打造专属小程序带完整搭建教程
原文地址：https://blog.csdn.net/m0_51111980/article/details/127730116