• 浅析Hive窗口分析函数



    Hive系列


    Hive窗口分析函数全解

    零. 概念

    ​ 窗口分析函数:窗口函数也称为OLAP(Online Analytical Processing)函数,是对一组值进行操作,不需要使用Group by子句对数据进行分组,还能在同一行返回原来行的列和使用聚合函数得到的聚合列。

    官网地址: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

    一. sum, avg, max, min

    数据准备:cookie.txt

    cookie1,2021-04-10,1
    cookie1,2021-04-11,5
    cookie1,2021-04-12,7
    cookie1,2021-04-13,3
    cookie1,2021-04-14,2
    cookie1,2021-04-15,4
    cookie1,2021-04-16,4
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    建表准备:

    create database if not exists myhive;
    use myhive;
    drop table if exists cookie;
    create table cookie(cookieid string, createtime string, pv int) row format delimited fields terminated by ',';
    load data local inpath "/home/data/cookie.txt" into table cookie;
    select * from cookie;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    1.1. sum

    语句:

    注:在SQL中加上 – 表示注释,不影响语句执行。

    SELECT cookieid,createtime,pv,
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED
    PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
    SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING
    AND CURRENT ROW) AS pv4, --当前行+往前3行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING
    AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW
    AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行
    FROM cookie order by cookieid, createtime;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    上面给注释去掉等同于下面,都可以直接运行:

    SELECT cookieid,createtime,pv,
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,
    SUM(pv) OVER(PARTITION BY cookieid) AS pv3,
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,
    SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 
    FROM cookie order by cookieid, createtime;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    结果如下:

    Total MapReduce CPU Time Spent: 0 msec
    OK
    cookie1	2021-04-10	1	1	1	26	1	6	26
    cookie1	2021-04-11	5	6	6	26	6	13	25
    cookie1	2021-04-12	7	13	13	26	13	16	20
    cookie1	2021-04-13	3	16	16	26	16	18	13
    cookie1	2021-04-14	2	18	18	26	17	21	10
    cookie1	2021-04-15	4	22	22	26	16	20	8
    cookie1	2021-04-16	4	26	26	26	13	13	4
    Time taken: 5.511 seconds, Fetched: 7 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    解释:

    pv1: 分组内从起点到当前行的pv累积,如,11号的pv1 = 10号的pv+11号的pv, 12号 = 10号+11号+12号
    pv2: 同pv1
    pv3: 分组内(cookie1)所有的pv累加
    pv4: 分组内当前行+往前3行,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
    pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
    pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    扩展:

    如果不指定ROWS BETWEEN,默认为从起点到当前行;
    如果不指定ORDER BY,则将分组内所有值累加;
    关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
    PRECEDING:往前
    FOLLOWING:往后
    CURRENT ROW:当前行
    UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    1.2. avg,min,max,sum

    ​ AVG,MIN,MAX,和SUM用法一样。只需要把sum函数,改成avg,min,max,sum等就可以。

    语句:

    SELECT cookieid, createtime, pv,
    round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime), 2) AS pv1, --默认为从起点到当前行
    round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN
    UNBOUNDED PRECEDING AND CURRENT ROW), 2) AS pv2, --从起点到当前行,结果同pv1
    round(AVG(pv) OVER(PARTITION BY cookieid), 2) AS pv3, --分组内所有行
    round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3
    PRECEDING AND CURRENT ROW), 2) AS pv4, --当前行+往前3行
    round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3
    PRECEDING AND 1 FOLLOWING), 2) AS pv5, --当前行+往前3行+往后1行
    round(AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN
    CURRENT ROW AND UNBOUNDED FOLLOWING), 2) AS pv6 --当前行+往后所有行
    FROM cookie;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    结果如下:

    Total MapReduce CPU Time Spent: 0 msec
    OK
    cookie1	2021-04-16	4	3.71	3.71	3.71	3.25	3.25	4.0
    cookie1	2021-04-15	4	3.67	3.67	3.71	4.0	4.0	4.0
    cookie1	2021-04-14	2	3.6	3.6	3.71	4.25	4.2	3.33
    cookie1	2021-04-13	3	4.0	4.0	3.71	4.0	3.6	3.25
    cookie1	2021-04-12	7	4.33	4.33	3.71	4.33	4.0	4.0
    cookie1	2021-04-11	5	3.0	3.0	3.71	3.0	4.33	4.17
    cookie1	2021-04-10	1	1.0	1.0	3.71	1.0	3.0	3.71
    Time taken: 2.911 seconds, Fetched: 7 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    二、 ntile, row_number, rank, dense_rank

    准备数据:cookie2.txt

    cookie1,2021-04-10,1
    cookie1,2021-04-11,5
    cookie1,2021-04-12,7
    cookie1,2021-04-13,3
    cookie1,2021-04-14,2
    cookie1,2021-04-15,4
    cookie1,2021-04-16,4
    cookie2,2021-04-10,2
    cookie2,2021-04-11,3
    cookie2,2021-04-12,5
    cookie2,2021-04-13,6
    cookie2,2021-04-14,3
    cookie2,2021-04-15,9
    cookie2,2021-04-16,7
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    建表导入数据相关操作:

    create database if not exists myhive;
    use myhive;
    drop table if exists cookie2;
    create table cookie2(cookieid string, createtime string, pv int) row format
    delimited fields terminated by ',';
    load data local inpath "/home/data/cookie2.txt" into table cookie2;
    select * from cookie2;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    2.1. ntile

    NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
    NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
    如果切片不均匀,默认增加第一个切片的分布

    SQL语句实例:

    SELECT cookieid,createtime,pv,
    NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分组内将数据分成2片
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分组内将数据分成3片
    NTILE(4) OVER(ORDER BY createtime) AS rn3 --将所有数据分成4片
    FROM cookie2 ORDER BY cookieid,createtime;
    
    • 1
    • 2
    • 3
    • 4
    • 5

    结果如下:

    Total MapReduce CPU Time Spent: 0 msec
    OK
    cookie1	2021-04-10	1	1	1	1
    cookie1	2021-04-11	5	1	1	1
    cookie1	2021-04-12	7	1	1	2
    cookie1	2021-04-13	3	1	2	2
    cookie1	2021-04-14	2	2	2	3
    cookie1	2021-04-15	4	2	3	4
    cookie1	2021-04-16	4	2	3	4
    cookie2	2021-04-10	2	1	1	1
    cookie2	2021-04-11	3	1	1	1
    cookie2	2021-04-12	5	1	1	2
    cookie2	2021-04-13	6	1	2	2
    cookie2	2021-04-14	3	2	2	3
    cookie2	2021-04-15	9	2	3	3
    cookie2	2021-04-16	7	2	3	4
    Time taken: 4.193 seconds, Fetched: 14 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    比如,统计一个cookie,pv数最多的前1/3的天

    SELECT cookieid, createtime, pv,
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn
    FROM cookie2;
    
    • 1
    • 2
    • 3

    结果:

    Stage-Stage-1:  HDFS Read: 2646 HDFS Write: 882 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    cookie1	2021-04-12	7	1
    cookie1	2021-04-11	5	1
    cookie1	2021-04-16	4	1
    cookie1	2021-04-15	4	2
    cookie1	2021-04-13	3	2
    cookie1	2021-04-14	2	3
    cookie1	2021-04-10	1	3
    cookie2	2021-04-15	9	1
    cookie2	2021-04-16	7	1
    cookie2	2021-04-13	6	1
    cookie2	2021-04-12	5	2
    cookie2	2021-04-11	3	2
    cookie2	2021-04-14	3	3
    cookie2	2021-04-10	2	3
    Time taken: 1.349 seconds, Fetched: 14 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    其中:rn = 1 的记录,就是我们想要的结果

    2.2. row_number

    ​ ROW_NUMBER() – 从1开始,按照顺序,生成分组内记录的序列 –比如,按照pv降序排列,生成分组内
    每天的pv名次,ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个
    session中的第一条refer等。

    // 分组排序

    SELECT cookieid, createtime, pv,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
    FROM cookie2;
    
    • 1
    • 2
    • 3

    结果:

    cookie1	2021-04-12	7	1
    cookie1	2021-04-11	5	2
    cookie1	2021-04-16	4	3
    cookie1	2021-04-15	4	4
    cookie1	2021-04-13	3	5
    cookie1	2021-04-14	2	6
    cookie1	2021-04-10	1	7
    cookie2	2021-04-15	9	1
    cookie2	2021-04-16	7	2
    cookie2	2021-04-13	6	3
    cookie2	2021-04-12	5	4
    cookie2	2021-04-11	3	5
    cookie2	2021-04-14	3	6
    cookie2	2021-04-10	2	7
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    所以如果需要取每一组的前3名,只需要rn<=3即可

    2.3 rank 和 dense_rank

    RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位
    DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

    SQL语句实例:

    SELECT cookieid, createtime, pv,
    RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
    DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
    FROM cookie2
    WHERE cookieid = 'cookie1';
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    结果如下:

    cookie1	2021-04-12	7	1	1	1
    cookie1	2021-04-11	5	2	2	2
    cookie1	2021-04-16	4	3	3	3
    cookie1	2021-04-15	4	3	3	4
    cookie1	2021-04-13	3	5	4	5
    cookie1	2021-04-14	2	6	5	6
    cookie1	2021-04-10	1	7	6	7
    Time taken: 1.411 seconds, Fetched: 7 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    三者对比总结:

    row_number 按顺序编号,不留空位
    rank 按顺序编号,相同的值编相同号,留空位
    dense_rank 按顺序编号,相同的值编相同的号,不留空位
    
    • 1
    • 2
    • 3

    三、 cume_dist, percent_rank

    数据准备:cookie3.txt

    d1,user1,1000
    d1,user2,2000
    d1,user3,3000
    d2,user4,4000
    d2,user5,5000
    
    • 1
    • 2
    • 3
    • 4
    • 5

    建表导入数据相关操作:

    create database if not exists myhive;
    use myhive;
    drop table if exists cookie3;
    create table cookie3(dept string, userid string, sal int) row format delimited
    fields terminated by ',';
    load data local inpath "/home/data/cookie3.txt" into table cookie3;
    select * from cookie3;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    3.1. cume_dist

    –CUME_DIST 小于等于当前值的行数/分组内总行数
    比如,统计小于等于当前薪水的人数,所占总人数的比例

    语句:

    SELECT dept, userid, sal,
    round(CUME_DIST() OVER(ORDER BY sal), 2) AS rn1,
    round(CUME_DIST() OVER(PARTITION BY dept ORDER BY sal), 2) AS rn2
    FROM cookie3;
    
    • 1
    • 2
    • 3
    • 4

    结果如下:

    OK
    d1	user1	1000	0.2	0.33
    d1	user2	2000	0.4	0.67
    d1	user3	3000	0.6	1.0
    d2	user4	4000	0.8	0.5
    d2	user5	5000	1.0	1.0
    Time taken: 2.736 seconds, Fetched: 5 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    3.2. percent_rank

    PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
    SQL语句实例1:

    SELECT dept, userid, sal,
    PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分组内
    RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值
    SUM(1) OVER(PARTITION BY NULL) AS rn12, --分组内总行数
    PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
    FROM cookie3;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    结果:

    OK
    d1	user1	1000	0.0	1	5	0.0
    d1	user2	2000	0.25	2	5	0.5
    d1	user3	3000	0.5	3	5	1.0
    d2	user4	4000	0.75	4	5	0.0
    d2	user5	5000	1.0	5	5	1.0
    Time taken: 3.719 seconds, Fetched: 5 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    SQL语句实例2:

    SELECT dept, userid, sal,
    PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn1, --分组内
    RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值
    SUM(1) OVER(PARTITION BY NULL) AS rn12, --分组内总行数
    PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
    FROM cookie3;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    结果:

    Total MapReduce CPU Time Spent: 0 msec
    OK
    d2	user5	5000	1.0	5	5	1.0
    d2	user4	4000	0.0	4	5	0.0
    d1	user3	3000	1.0	3	5	1.0
    d1	user2	2000	0.5	2	5	0.5
    d1	user1	1000	0.0	1	5	0.0
    Time taken: 3.814 seconds, Fetched: 5 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    四. lag, lead, frist_value, last_value

    数据准备:cookie4.txt

    cookie1,2021-04-10 10:00:02,url2
    cookie1,2021-04-10 10:00:00,url1
    cookie1,2021-04-10 10:03:04,1url3
    cookie1,2021-04-10 10:50:05,url6
    cookie1,2021-04-10 11:00:00,url7
    cookie1,2021-04-10 10:10:00,url4
    cookie1,2021-04-10 10:50:01,url5
    cookie2,2021-04-10 10:00:02,url22
    cookie2,2021-04-10 10:00:00,url11
    cookie2,2021-04-10 10:03:04,1url33
    cookie2,2021-04-10 10:50:05,url66
    cookie2,2021-04-10 11:00:00,url77
    cookie2,2021-04-10 10:10:00,url44
    cookie2,2021-04-10 10:50:01,url55
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    建表导入数据相关操作:

    create database if not exists myhive;
    use myhive;
    drop table if exists cookie4;
    create table cookie4(cookieid string, createtime string, url string) row format
    delimited fields terminated by ',';
    load data local inpath "/home/data/cookie4.txt" into table cookie4;
    select * from cookie4;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    4.1. lag

    LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
    第一个参数为列名,
    第二个参数为往上第n行(可选,默认为1),
    第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    SQL语句实例:

    SELECT cookieid, createtime, url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY
    createtime) AS last_1_time,
    LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
    FROM cookie4;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    结果数据:

    Total MapReduce CPU Time Spent: 0 msec
    OK
    cookie1	2021-04-10 10:00:00	url1	1	1970-01-01 00:00:00	NULL
    cookie1	2021-04-10 10:00:02	url2	2	2021-04-10 10:00:00	NULL
    cookie1	2021-04-10 10:03:04	1url3	3	2021-04-10 10:00:02	2021-04-10 10:00:00
    cookie1	2021-04-10 10:10:00	url4	4	2021-04-10 10:03:04	2021-04-10 10:00:02
    cookie1	2021-04-10 10:50:01	url5	5	2021-04-10 10:10:00	2021-04-10 10:03:04
    cookie1	2021-04-10 10:50:05	url6	6	2021-04-10 10:50:01	2021-04-10 10:10:00
    cookie1	2021-04-10 11:00:00	url7	7	2021-04-10 10:50:05	2021-04-10 10:50:01
    cookie2	2021-04-10 10:00:00	url11	1	1970-01-01 00:00:00	NULL
    cookie2	2021-04-10 10:00:02	url22	2	2021-04-10 10:00:00	NULL
    cookie2	2021-04-10 10:03:04	1url33	3	2021-04-10 10:00:02	2021-04-10 10:00:00
    cookie2	2021-04-10 10:10:00	url44	4	2021-04-10 10:03:04	2021-04-10 10:00:02
    cookie2	2021-04-10 10:50:01	url55	5	2021-04-10 10:10:00	2021-04-10 10:03:04
    cookie2	2021-04-10 10:50:05	url66	6	2021-04-10 10:50:01	2021-04-10 10:10:00
    cookie2	2021-04-10 11:00:00	url77	7	2021-04-10 10:50:05	2021-04-10 10:50:01
    Time taken: 1.315 seconds, Fetched: 14 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    解释:

    last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00'
    cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00
    cookie1第三行,往上1行值为第二行值, 2021-04-10 10:00:02
    cookie1第六行,往上1行值为第五行值, 2021-04-10 10:50:01
    last_2_time: 指定了往上第2行的值,为指定默认值
    cookie1第一行,往上2行为NULL
    cookie1第二行,往上2行为NULL
    cookie1第四行,往上2行为第二行值, 2021-04-10 10:00:02
    cookie1第七行,往上2行为第五行值, 2021-04-10 10:50:01
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    4.2. lead

    与LAG相反
    LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
    第一个参数为列名,
    第二个参数为往下第n行(可选,默认为1),
    第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

    SQL语句实例:

    SELECT cookieid, createtime, url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY
    createtime) AS next_1_time,
    LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS
    next_2_time
    FROM cookie4;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    结果:

    Total MapReduce CPU Time Spent: 0 msec
    OK
    cookie1	2021-04-10 10:00:00	url1	1	2021-04-10 10:00:02	2021-04-10 10:03:04
    cookie1	2021-04-10 10:00:02	url2	2	2021-04-10 10:03:04	2021-04-10 10:10:00
    cookie1	2021-04-10 10:03:04	1url3	3	2021-04-10 10:10:00	2021-04-10 10:50:01
    cookie1	2021-04-10 10:10:00	url4	4	2021-04-10 10:50:01	2021-04-10 10:50:05
    cookie1	2021-04-10 10:50:01	url5	5	2021-04-10 10:50:05	2021-04-10 11:00:00
    cookie1	2021-04-10 10:50:05	url6	6	2021-04-10 11:00:00	NULL
    cookie1	2021-04-10 11:00:00	url7	7	1970-01-01 00:00:00	NULL
    cookie2	2021-04-10 10:00:00	url11	1	2021-04-10 10:00:02	2021-04-10 10:03:04
    cookie2	2021-04-10 10:00:02	url22	2	2021-04-10 10:03:04	2021-04-10 10:10:00
    cookie2	2021-04-10 10:03:04	1url33	3	2021-04-10 10:10:00	2021-04-10 10:50:01
    cookie2	2021-04-10 10:10:00	url44	4	2021-04-10 10:50:01	2021-04-10 10:50:05
    cookie2	2021-04-10 10:50:01	url55	5	2021-04-10 10:50:05	2021-04-10 11:00:00
    cookie2	2021-04-10 10:50:05	url66	6	2021-04-10 11:00:00	NULL
    cookie2	2021-04-10 11:00:00	url77	7	1970-01-01 00:00:00	NULL
    Time taken: 1.304 seconds, Fetched: 14 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    4.3. first_value

    取分组内排序后,截止到当前行,第一个值

    SQL语句实例:

    SELECT cookieid, createtime, url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
    FROM cookie4;
    
    • 1
    • 2
    • 3
    • 4

    结果:

    OK
    cookie1	2021-04-10 10:00:00	url1	1	url1
    cookie1	2021-04-10 10:00:02	url2	2	url1
    cookie1	2021-04-10 10:03:04	1url3	3	url1
    cookie1	2021-04-10 10:10:00	url4	4	url1
    cookie1	2021-04-10 10:50:01	url5	5	url1
    cookie1	2021-04-10 10:50:05	url6	6	url1
    cookie1	2021-04-10 11:00:00	url7	7	url1
    cookie2	2021-04-10 10:00:00	url11	1	url11
    cookie2	2021-04-10 10:00:02	url22	2	url11
    cookie2	2021-04-10 10:03:04	1url33	3	url11
    cookie2	2021-04-10 10:10:00	url44	4	url11
    cookie2	2021-04-10 10:50:01	url55	5	url11
    cookie2	2021-04-10 10:50:05	url66	6	url11
    cookie2	2021-04-10 11:00:00	url77	7	url11
    Time taken: 1.276 seconds, Fetched: 14 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    4.4. last_value

    取分组内排序后,截止到当前行,最后一个值

    SQL语句实例:

    SELECT cookieid, createtime, url,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
    LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
    FROM cookie4;
    
    • 1
    • 2
    • 3
    • 4

    结果:

    OK
    cookie1	2021-04-10 10:00:00	url1	1	url1
    cookie1	2021-04-10 10:00:02	url2	2	url2
    cookie1	2021-04-10 10:03:04	1url3	3	1url3
    cookie1	2021-04-10 10:10:00	url4	4	url4
    cookie1	2021-04-10 10:50:01	url5	5	url5
    cookie1	2021-04-10 10:50:05	url6	6	url6
    cookie1	2021-04-10 11:00:00	url7	7	url7
    cookie2	2021-04-10 10:00:00	url11	1	url11
    cookie2	2021-04-10 10:00:02	url22	2	url22
    cookie2	2021-04-10 10:03:04	1url33	3	1url33
    cookie2	2021-04-10 10:10:00	url44	4	url44
    cookie2	2021-04-10 10:50:01	url55	5	url55
    cookie2	2021-04-10 10:50:05	url66	6	url66
    cookie2	2021-04-10 11:00:00	url77	7	url77
    Time taken: 1.273 seconds, Fetched: 14 row(s)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16


    声明:
            文章中代码及相关语句为自己根据相应理解编写,文章中出现的相关图片为自己实践中的截图和相关技术对应的图片,若有相关异议,请联系删除。感谢。转载请注明出处,感谢。


    By luoyepiaoxue2014

    B站: https://space.bilibili.com/1523287361 点击打开链接
    微博地址: http://weibo.com/luoyepiaoxue2014 点击打开链接

  • 相关阅读:
    LeetCode 0710.黑名单中的随机数 - 预处理实现O(1)取值
    python代码书写规范和严格缩进问题,nginx 502 Bad Gateway
    C++之std::atomic解决多线程7个问题(二百四)
    Cesium屏幕中心坐标和相机位置坐标计算
    java中的重要IO流
    Python-Pandas库中的交叉表
    力扣1901.寻找峰值II
    搜索与图论:Prim
    java&springboot&mysql学生社团管理系统26281-计算机毕业设计项目选题推荐(附源码)
    XML解析
  • 原文地址:https://blog.csdn.net/luoyepiaoxue2014/article/details/128059229