找了两道Hive有关的SQL练习题目,第一道是统计用户浏览量的数据。第二道是统计京东商铺浏览量的数据。

(一)统计用户浏览量

我们有如下的用户访问数据

userIdvisitDatevisitCount
u012017/1/215
u022017/1/236
u032017/1/228
u042017/1/203
u012017/1/236
u012017/2/218
u022017/1/236
u012017/2/224

要求使用SQL统计出每个用户的累积访问次数,如下表所示:

用户id月份小计累积
u012017-011111
u012017-021223
u022017-011212
u032017-0188
u042017-0133

原始数据如下:

1
2
3
4
5
6
7
8
u01     2017/1/21       5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
u02 2017/1/23 6
u01 2017/2/22 4

建表语句

1
2
3
4
5
create table action
(userId string,
visitDate string,
visitCount int)
row format delimited fields terminated by "\t";

答:

一、首先转换月份格式,统计每个月、每个用户的访问量

1
2
3
4
5
6
7
8
SELECT 
userid id ,
date_format(REGEXP_REPLACE(visitdate,'/','-'),'yyyy-MM') mon ,
sum(visitcount) views
FROM
action
GROUP BY
userid , visitdate;

执行结果:

1
2
3
4
5
6
7
8
id	mon	views
u01 2017-01 5
u01 2017-01 6
u01 2017-02 8
u01 2017-02 4
u02 2017-01 12
u03 2017-01 8
u04 2017-01 3

二、再按照月份和用户进行分组并统计访问量。

1
2
3
4
5
6
7
8
9
10
11
SELECT 
id ,
mon ,
sum(visitcount) views
FROM(SELECT
userid id ,
date_format(REGEXP_REPLACE(visitdate,'/','-'),'yyyy-MM') mon ,
visitcount
FROM
action) t1
GROUP BY id ,mon;

执行结果

1
2
3
4
5
6
id	mon	views
u01 2017-01 11
u01 2017-02 12
u02 2017-01 12
u03 2017-01 8
u04 2017-01 3

三、 对每个用户的数据进行累加,再次子查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
select
userid,
mon,
views,
(sum(views) over(partition by userid order by mon)) total
FROM
(SELECT
userid ,
mon ,
sum(visitcount) views
FROM(SELECT
userid ,
date_format(REGEXP_REPLACE(visitdate,'/','-'),'yyyy-MM') mon ,
visitcount
FROM
action) t1
GROUP BY
userid ,mon)t2;

执行结果

1
2
3
4
5
6
userid	mon	views	total
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3


(二)统计网店访问量

有50W个京东店铺,每个顾客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为visit,访客的用户id为user_id,被访问的店铺名称为shop,请统计:
原始数据如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
u1	a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a

建表:

1
2
create table visit(user_id string,shop string) 
row format delimited fields terminated by '\t';

问:
1)每个店铺的UV(访客数)
2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数

答:

1)

一、使用GROUP BY对用户进行去重

1
2
3
4
5
6
7
8
SELECT
shop,
user_id
FROM
visit
GROUP BY
shop,
user_id

执行结果:

1
2
3
4
5
6
7
8
9
10
11
12
shop	user_id
a u1
a u2
a u3
a u5
b u1
b u2
b u4
b u5
c u2
c u3
c u6

二、对去重的结果进行用户数目统计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT 
shop,
COUNT(*) uv
FROM
(SELECT
shop,
user_id
FROM
visit
GROUP BY
shop,
user_id) t1
GROUP BY
shop;

执行结果

1
2
3
4
shop	uv
a 4
b 4
c 3

2)

一、计算每个人访问每个店铺的总次数

1
2
3
4
5
6
7
8
9
SELECT 
shop,
user_id,
count(*)
FROM
visit
GROUP BY
shop,
user_id;

二、针对同一店铺,对方问次数进行逆序排序,并计算rank值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT
shop,
user_id,
ct,
row_number() over(partition by shop order by ct desc ) rk
FROM
(SELECT
shop,
user_id,
count(*) ct
FROM
visit
GROUP BY
shop,
user_id) t1;

三、去店铺访问前三名的用户

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
SELECT
shop,
user_id,
ct
FROM
(SELECT
shop,
user_id,
ct,
row_number() over(partition by shop order by ct desc ) rk
FROM
(SELECT
shop,
user_id,
count(*) ct
FROM
visit
GROUP BY
shop,
user_id) t1)t2
WHERE
rk <= 3;

3

Exer4:

1
2
3
4
CREATE TABLE user_age
(dt STRING, //日期
user_id STRING, //用户
age INT) //年龄

1、所有用户的总数及平均年龄

1
2
3
4
5
6
7
SELECT 
user_id,
AVG(age) age
FROM
user_age
GROUP BY
user_id;t1

1
2
3
4
5
6
7
8
9
10
11
12

SELECT
COUNT(*) count1,
AVG(aage) avg_total_age
FROM
(SELECT
user_id,
AVG(age) aage
FROM
user_age
GROUP BY
user_id)t1;t0
1
2
3
count1  avg_total_age
3 27.0
~~~~~~~~~~~~~~~

2、活跃用户的总数及其平均年龄(2种方法)
① 用lead或者tag,在用Datediff找相邻一天的,然后去重就得出结果,不具备扩展性,实现多天活跃比较复杂。
② 等差数列
2.1 给每个用户的日期,同一天去重rank

1
2
3
4
5
6
7
8
SELECT
user_id,
dt,
ROW_Number() OVER(PARTITION BY user_id ORDER BY dt) rk
FROM
user_age
GROUP BY
user_id,dt;t2

2.2 用等差数列

1
2
3
4
5
SELECT
user_id,
DATE_SUB(dt,rk) diff
FROM
t2;t3

2.3 求出差距

1
2
3
4
5
SELECT
user_id,
COUNT(diff) OVER(PARTITION BY user_id,diff) count_diff
FROM
t3;t4

2.4 最大连续天数

1
2
3
4
5
6
7
8
SELECT
user_id
FROM
t4
WHERE
count_diff >= 2
GROUP BY
user_id;t5

2.5 去重并且得到age

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT
user_age.user_id,
user_age.age
FROM
user_age,t5
WHERE
user_age.user_id = t5.user_id;t6

SELECT
COUNT(*) active_count,
AVG(age) active_avg_age
FROM
t6; t7

3、合并

最终SQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
SELECT
count_total,
avg_total_age,
active_count,
active_avg_age
FROM
(SELECT
COUNT(*) count_total,
AVG(aage) avg_total_age
FROM
(SELECT
user_id,
AVG(age) aage
FROM
user_age
GROUP BY
user_id)t1)t0
,
(SELECT
COUNT(*) active_count,
AVG(age) active_avg_age
FROM
(SELECT
DISTINCT user_age.user_id,
user_age.age
FROM
user_age,
(SELECT
user_id
FROM
(SELECT
user_id,
COUNT(diff) OVER(PARTITION BY user_id,diff) count_diff
FROM
(SELECT
user_id,
DATE_SUB(dt,rk) diff
FROM
(SELECT
user_id,
dt,
ROW_Number() OVER(PARTITION BY user_id ORDER BY dt) rk
FROM
user_age
GROUP BY
user_id,dt)t2)t3)t4
WHERE
count_diff >= 2
GROUP BY
user_id)t5
WHERE
user_age.user_id = t5.user_id)t6)t7;

1
2
3
count_total avg_total_age   active_count    active_avg_age
3 27.0 1 19.0
~~~~~~~~~~~~~~~