Hive系列
窗口分析函数:窗口函数也称为OLAP(Online Analytical Processing)函数,是对一组值进行操作,不需要使用Group by子句对数据进行分组,还能在同一行返回原来行的列和使用聚合函数得到的聚合列。
官网地址: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
数据准备:cookie.txt
cookie1,2021-04-10,1
cookie1,2021-04-11,5
cookie1,2021-04-12,7
cookie1,2021-04-13,3
cookie1,2021-04-14,2
cookie1,2021-04-15,4
cookie1,2021-04-16,4
建表准备:
create database if not exists myhive;
use myhive;
drop table if exists cookie;
create table cookie(cookieid string, createtime string, pv int) row format delimited fields terminated by ',';
load data local inpath "/home/data/cookie.txt" into table cookie;
select * from cookie;
语句:
注:在SQL中加上 – 表示注释,不影响语句执行。
SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING
AND CURRENT ROW) AS pv4, --当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING
AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行
FROM cookie order by cookieid, createtime;
上面给注释去掉等同于下面,都可以直接运行:
SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,
SUM(pv) OVER(PARTITION BY cookieid) AS pv3,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6
FROM cookie order by cookieid, createtime;
结果如下:
Total MapReduce CPU Time Spent: 0 msec
OK
cookie1 2021-04-10 1 1 1 26 1 6 26
cookie1 2021-04-11 5 6 6 26 6 13 25
cookie1 2021-04-12 7 13 13 26 13 16 20
cookie1 2021-04-13 3 16 16 26 16 18 13
cookie1 2021-04-14 2 18 18 26 17 21 10
cookie1 2021-04-15 4 22 22 26 16 20 8
cookie1 2021-04-16 4 26 26 26 13 13 4
Time taken: 5.511 seconds, Fetched: 7 row(s)
解释:
pv1: 分组内从起点到当前行的pv累积,如,11号的pv1 = 10号的pv+11号的pv, 12号 = 10号+11号+12号
pv2: 同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: 分组内当前行+往前3行,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10
扩展:
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
AVG,MIN,MAX,和SUM用法一样。只需要把sum函数,改成avg,min,max,sum等就可以。
语句:
SELECT cookieid, createtime, pv,
round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime), 2) AS pv1, --默认为从起点到当前行
round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW), 2) AS pv2, --从起点到当前行,结果同pv1
round(AVG(pv) OVER(PARTITION BY cookieid), 2) AS pv3, --分组内所有行
round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3
PRECEDING AND CURRENT ROW), 2) AS pv4, --当前行+往前3行
round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3
PRECEDING AND 1 FOLLOWING), 2) AS pv5, --当前行+往前3行+往后1行
round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN
CURRENT ROW AND UNBOUNDED FOLLOWING), 2) AS pv6 --当前行+往后所有行
FROM cookie;
结果如下:
Total MapReduce CPU Time Spent: 0 msec
OK
cookie1 2021-04-16 4 3.71 3.71 3.71 3.25 3.25 4.0
cookie1 2021-04-15 4 3.67 3.67 3.71 4.0 4.0 4.0
cookie1 2021-04-14 2 3.6 3.6 3.71 4.25 4.2 3.33
cookie1 2021-04-13 3 4.0 4.0 3.71 4.0 3.6 3.25
cookie1 2021-04-12 7 4.33 4.33 3.71 4.33 4.0 4.0
cookie1 2021-04-11 5 3.0 3.0 3.71 3.0 4.33 4.17
cookie1 2021-04-10 1 1.0 1.0 3.71 1.0 3.0 3.71
Time taken: 2.911 seconds, Fetched: 7 row(s)
准备数据:cookie2.txt
cookie1,2021-04-10,1
cookie1,2021-04-11,5
cookie1,2021-04-12,7
cookie1,2021-04-13,3
cookie1,2021-04-14,2
cookie1,2021-04-15,4
cookie1,2021-04-16,4
cookie2,2021-04-10,2
cookie2,2021-04-11,3
cookie2,2021-04-12,5
cookie2,2021-04-13,6
cookie2,2021-04-14,3
cookie2,2021-04-15,9
cookie2,2021-04-16,7
建表导入数据相关操作:
create database if not exists myhive;
use myhive;
drop table if exists cookie2;
create table cookie2(cookieid string, createtime string, pv int) row format
delimited fields terminated by ',';
load data local inpath "/home/data/cookie2.txt" into table cookie2;
select * from cookie2;
NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均匀,默认增加第一个切片的分布
SQL语句实例:
SELECT cookieid,createtime,pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分组内将数据分成2片
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分组内将数据分成3片
NTILE(4) OVER(ORDER BY createtime) AS rn3 --将所有数据分成4片
FROM cookie2 ORDER BY cookieid,createtime;
结果如下:
Total MapReduce CPU Time Spent: 0 msec
OK
cookie1 2021-04-10 1 1 1 1
cookie1 2021-04-11 5 1 1 1
cookie1 2021-04-12 7 1 1 2
cookie1 2021-04-13 3 1 2 2
cookie1 2021-04-14 2 2 2 3
cookie1 2021-04-15 4 2 3 4
cookie1 2021-04-16 4 2 3 4
cookie2 2021-04-10 2 1 1 1
cookie2 2021-04-11 3 1 1 1
cookie2 2021-04-12 5 1 1 2
cookie2 2021-04-13 6 1 2 2
cookie2 2021-04-14 3 2 2 3
cookie2 2021-04-15 9 2 3 3
cookie2 2021-04-16 7 2 3 4
Time taken: 4.193 seconds, Fetched: 14 row(s)
比如,统计一个cookie,pv数最多的前1/3的天
SELECT cookieid, createtime, pv,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn
FROM cookie2;
结果:
Stage-Stage-1: HDFS Read: 2646 HDFS Write: 882 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
cookie1 2021-04-12 7 1
cookie1 2021-04-11 5 1
cookie1 2021-04-16 4 1
cookie1 2021-04-15 4 2
cookie1 2021-04-13 3 2
cookie1 2021-04-14 2 3
cookie1 2021-04-10 1 3
cookie2 2021-04-15 9 1
cookie2 2021-04-16 7 1
cookie2 2021-04-13 6 1
cookie2 2021-04-12 5 2
cookie2 2021-04-11 3 2
cookie2 2021-04-14 3 3
cookie2 2021-04-10 2 3
Time taken: 1.349 seconds, Fetched: 14 row(s)
其中:rn = 1 的记录,就是我们想要的结果
ROW_NUMBER() – 从1开始,按照顺序,生成分组内记录的序列 –比如,按照pv降序排列,生成分组内
每天的pv名次,ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个
session中的第一条refer等。
// 分组排序
SELECT cookieid, createtime, pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM cookie2;
结果:
cookie1 2021-04-12 7 1
cookie1 2021-04-11 5 2
cookie1 2021-04-16 4 3
cookie1 2021-04-15 4 4
cookie1 2021-04-13 3 5
cookie1 2021-04-14 2 6
cookie1 2021-04-10 1 7
cookie2 2021-04-15 9 1
cookie2 2021-04-16 7 2
cookie2 2021-04-13 6 3
cookie2 2021-04-12 5 4
cookie2 2021-04-11 3 5
cookie2 2021-04-14 3 6
cookie2 2021-04-10 2 7
所以如果需要取每一组的前3名,只需要rn<=3即可
RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位
SQL语句实例:
SELECT cookieid, createtime, pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM cookie2
WHERE cookieid = 'cookie1';
结果如下:
cookie1 2021-04-12 7 1 1 1
cookie1 2021-04-11 5 2 2 2
cookie1 2021-04-16 4 3 3 3
cookie1 2021-04-15 4 3 3 4
cookie1 2021-04-13 3 5 4 5
cookie1 2021-04-14 2 6 5 6
cookie1 2021-04-10 1 7 6 7
Time taken: 1.411 seconds, Fetched: 7 row(s)
三者对比总结:
row_number 按顺序编号,不留空位
rank 按顺序编号,相同的值编相同号,留空位
dense_rank 按顺序编号,相同的值编相同的号,不留空位
数据准备:cookie3.txt
d1,user1,1000
d1,user2,2000
d1,user3,3000
d2,user4,4000
d2,user5,5000
建表导入数据相关操作:
create database if not exists myhive;
use myhive;
drop table if exists cookie3;
create table cookie3(dept string, userid string, sal int) row format delimited
fields terminated by ',';
load data local inpath "/home/data/cookie3.txt" into table cookie3;
select * from cookie3;
–CUME_DIST 小于等于当前值的行数/分组内总行数
比如,统计小于等于当前薪水的人数,所占总人数的比例
语句:
SELECT dept, userid, sal,
round(CUME_DIST() OVER(ORDER BY sal), 2) AS rn1,
round(CUME_DIST() OVER(PARTITION BY dept ORDER BY sal), 2) AS rn2
FROM cookie3;
结果如下:
OK
d1 user1 1000 0.2 0.33
d1 user2 2000 0.4 0.67
d1 user3 3000 0.6 1.0
d2 user4 4000 0.8 0.5
d2 user5 5000 1.0 1.0
Time taken: 2.736 seconds, Fetched: 5 row(s)
PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
SQL语句实例1:
SELECT dept, userid, sal,
PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分组内
RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值
SUM(1) OVER(PARTITION BY NULL) AS rn12, --分组内总行数
PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM cookie3;
结果:
OK
d1 user1 1000 0.0 1 5 0.0
d1 user2 2000 0.25 2 5 0.5
d1 user3 3000 0.5 3 5 1.0
d2 user4 4000 0.75 4 5 0.0
d2 user5 5000 1.0 5 5 1.0
Time taken: 3.719 seconds, Fetched: 5 row(s)
SQL语句实例2:
SELECT dept, userid, sal,
PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn1, --分组内
RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值
SUM(1) OVER(PARTITION BY NULL) AS rn12, --分组内总行数
PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM cookie3;
结果:
Total MapReduce CPU Time Spent: 0 msec
OK
d2 user5 5000 1.0 5 5 1.0
d2 user4 4000 0.0 4 5 0.0
d1 user3 3000 1.0 3 5 1.0
d1 user2 2000 0.5 2 5 0.5
d1 user1 1000 0.0 1 5 0.0
Time taken: 3.814 seconds, Fetched: 5 row(s)
数据准备:cookie4.txt
cookie1,2021-04-10 10:00:02,url2
cookie1,2021-04-10 10:00:00,url1
cookie1,2021-04-10 10:03:04,1url3
cookie1,2021-04-10 10:50:05,url6
cookie1,2021-04-10 11:00:00,url7
cookie1,2021-04-10 10:10:00,url4
cookie1,2021-04-10 10:50:01,url5
cookie2,2021-04-10 10:00:02,url22
cookie2,2021-04-10 10:00:00,url11
cookie2,2021-04-10 10:03:04,1url33
cookie2,2021-04-10 10:50:05,url66
cookie2,2021-04-10 11:00:00,url77
cookie2,2021-04-10 10:10:00,url44
cookie2,2021-04-10 10:50:01,url55
建表导入数据相关操作:
create database if not exists myhive;
use myhive;
drop table if exists cookie4;
create table cookie4(cookieid string, createtime string, url string) row format
delimited fields terminated by ',';
load data local inpath "/home/data/cookie4.txt" into table cookie4;
select * from cookie4;
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,
第二个参数为往上第n行(可选,默认为1),
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SQL语句实例:
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY
createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM cookie4;
结果数据:
Total MapReduce CPU Time Spent: 0 msec
OK
cookie1 2021-04-10 10:00:00 url1 1 1970-01-01 00:00:00 NULL
cookie1 2021-04-10 10:00:02 url2 2 2021-04-10 10:00:00 NULL
cookie1 2021-04-10 10:03:04 1url3 3 2021-04-10 10:00:02 2021-04-10 10:00:00
cookie1 2021-04-10 10:10:00 url4 4 2021-04-10 10:03:04 2021-04-10 10:00:02
cookie1 2021-04-10 10:50:01 url5 5 2021-04-10 10:10:00 2021-04-10 10:03:04
cookie1 2021-04-10 10:50:05 url6 6 2021-04-10 10:50:01 2021-04-10 10:10:00
cookie1 2021-04-10 11:00:00 url7 7 2021-04-10 10:50:05 2021-04-10 10:50:01
cookie2 2021-04-10 10:00:00 url11 1 1970-01-01 00:00:00 NULL
cookie2 2021-04-10 10:00:02 url22 2 2021-04-10 10:00:00 NULL
cookie2 2021-04-10 10:03:04 1url33 3 2021-04-10 10:00:02 2021-04-10 10:00:00
cookie2 2021-04-10 10:10:00 url44 4 2021-04-10 10:03:04 2021-04-10 10:00:02
cookie2 2021-04-10 10:50:01 url55 5 2021-04-10 10:10:00 2021-04-10 10:03:04
cookie2 2021-04-10 10:50:05 url66 6 2021-04-10 10:50:01 2021-04-10 10:10:00
cookie2 2021-04-10 11:00:00 url77 7 2021-04-10 10:50:05 2021-04-10 10:50:01
Time taken: 1.315 seconds, Fetched: 14 row(s)
解释:
last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00'
cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00
cookie1第三行,往上1行值为第二行值, 2021-04-10 10:00:02
cookie1第六行,往上1行值为第五行值, 2021-04-10 10:50:01
last_2_time: 指定了往上第2行的值,为指定默认值
cookie1第一行,往上2行为NULL
cookie1第二行,往上2行为NULL
cookie1第四行,往上2行为第二行值, 2021-04-10 10:00:02
cookie1第七行,往上2行为第五行值, 2021-04-10 10:50:01
与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,
第二个参数为往下第n行(可选,默认为1),
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SQL语句实例:
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY
createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS
next_2_time
FROM cookie4;
结果:
Total MapReduce CPU Time Spent: 0 msec
OK
cookie1 2021-04-10 10:00:00 url1 1 2021-04-10 10:00:02 2021-04-10 10:03:04
cookie1 2021-04-10 10:00:02 url2 2 2021-04-10 10:03:04 2021-04-10 10:10:00
cookie1 2021-04-10 10:03:04 1url3 3 2021-04-10 10:10:00 2021-04-10 10:50:01
cookie1 2021-04-10 10:10:00 url4 4 2021-04-10 10:50:01 2021-04-10 10:50:05
cookie1 2021-04-10 10:50:01 url5 5 2021-04-10 10:50:05 2021-04-10 11:00:00
cookie1 2021-04-10 10:50:05 url6 6 2021-04-10 11:00:00 NULL
cookie1 2021-04-10 11:00:00 url7 7 1970-01-01 00:00:00 NULL
cookie2 2021-04-10 10:00:00 url11 1 2021-04-10 10:00:02 2021-04-10 10:03:04
cookie2 2021-04-10 10:00:02 url22 2 2021-04-10 10:03:04 2021-04-10 10:10:00
cookie2 2021-04-10 10:03:04 1url33 3 2021-04-10 10:10:00 2021-04-10 10:50:01
cookie2 2021-04-10 10:10:00 url44 4 2021-04-10 10:50:01 2021-04-10 10:50:05
cookie2 2021-04-10 10:50:01 url55 5 2021-04-10 10:50:05 2021-04-10 11:00:00
cookie2 2021-04-10 10:50:05 url66 6 2021-04-10 11:00:00 NULL
cookie2 2021-04-10 11:00:00 url77 7 1970-01-01 00:00:00 NULL
Time taken: 1.304 seconds, Fetched: 14 row(s)
取分组内排序后,截止到当前行,第一个值
SQL语句实例:
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM cookie4;
结果:
OK
cookie1 2021-04-10 10:00:00 url1 1 url1
cookie1 2021-04-10 10:00:02 url2 2 url1
cookie1 2021-04-10 10:03:04 1url3 3 url1
cookie1 2021-04-10 10:10:00 url4 4 url1
cookie1 2021-04-10 10:50:01 url5 5 url1
cookie1 2021-04-10 10:50:05 url6 6 url1
cookie1 2021-04-10 11:00:00 url7 7 url1
cookie2 2021-04-10 10:00:00 url11 1 url11
cookie2 2021-04-10 10:00:02 url22 2 url11
cookie2 2021-04-10 10:03:04 1url33 3 url11
cookie2 2021-04-10 10:10:00 url44 4 url11
cookie2 2021-04-10 10:50:01 url55 5 url11
cookie2 2021-04-10 10:50:05 url66 6 url11
cookie2 2021-04-10 11:00:00 url77 7 url11
Time taken: 1.276 seconds, Fetched: 14 row(s)
取分组内排序后,截止到当前行,最后一个值
SQL语句实例:
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM cookie4;
结果:
OK
cookie1 2021-04-10 10:00:00 url1 1 url1
cookie1 2021-04-10 10:00:02 url2 2 url2
cookie1 2021-04-10 10:03:04 1url3 3 1url3
cookie1 2021-04-10 10:10:00 url4 4 url4
cookie1 2021-04-10 10:50:01 url5 5 url5
cookie1 2021-04-10 10:50:05 url6 6 url6
cookie1 2021-04-10 11:00:00 url7 7 url7
cookie2 2021-04-10 10:00:00 url11 1 url11
cookie2 2021-04-10 10:00:02 url22 2 url22
cookie2 2021-04-10 10:03:04 1url33 3 1url33
cookie2 2021-04-10 10:10:00 url44 4 url44
cookie2 2021-04-10 10:50:01 url55 5 url55
cookie2 2021-04-10 10:50:05 url66 6 url66
cookie2 2021-04-10 11:00:00 url77 7 url77
Time taken: 1.273 seconds, Fetched: 14 row(s)
声明:
文章中代码及相关语句为自己根据相应理解编写,文章中出现的相关图片为自己实践中的截图和相关技术对应的图片,若有相关异议,请联系删除。感谢。转载请注明出处,感谢。
B站: https://space.bilibili.com/1523287361 点击打开链接
微博地址: http://weibo.com/luoyepiaoxue2014 点击打开链接