来看一道互联网公司的面试题:

1
2
3
有个用户好友表:字段如下
uid fans_uid score
返回:uid, fans_uid_list【fans_uid的拼接串,按照score降序拼接】

给出数据源:每个uid,有很多对应的fans_uid,每个fans_uid 都对应一个score,我们需要按uid分组,将fans_uid 的score按降序排序,将fans_uid 放在一个列表中,做好友推荐

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
create temporary table tb_user_fans as 
select 1 as uid,'a' as fans_uid,3 as score
union all
select 1 as uid,'b' as fans_uid,1 as score
union all
select 1 as uid,'c' as fans_uid,4 as score
union all
select 1 as uid,'d' as fans_uid,3 as score
union all
select 1 as uid,'e' as fans_uid,2 as score
union all
select 2 as uid,'a' as fans_uid,4 as score
union all
select 2 as uid,'b' as fans_uid,3 as score
union all
select 2 as uid,'c' as fans_uid,1 as score
union all
select 2 as uid,'d' as fans_uid,2 as score
union all
select 2 as uid,'e' as fans_uid,5 as score
union all
select 3 as uid,'a' as fans_uid,6 as score
union all
select 3 as uid,'b' as fans_uid,3 as score
union all
select 3 as uid,'c' as fans_uid,5 as score
;

我想要的结果:按照uid分组,按照score降序排序,取出fans_uid放到列表里

1
2
3
1	[c,d,a,e,b]
2 [e,a,b,d,c]
3 [a,c,b]

解法一:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
select
uid,
collect_list(fans_uid) as fans_uid_list
from (
select
uid
,fans_uid
,score
,row_cnt
from (
select
uid
,fans_uid
,score
,row_number() over (partition by uid order by score desc ) as row_cnt
from tb_user_fans
)
order by row_cnt asc
)
group by uid
order by uid

如果数据量大不推荐使用,因为使用到了全局排序order by,只有一个reducer,那么数据量大计算要很长时间。

解法二:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
select 
uid
,regexp_replace(
concat_ws(
','
,sort_array(
collect_list(
conact_ws(':' ,lpad(cast(rank_num as string),5,'0') ,fans_uid)
)
)
)
,'\\d+\:'
,''
) as fans_uid_list
from (
select
uid
,fans_uid
,score
,row_number() over (partition by uid order by score desc ) as row_cnt
from tb_user_fans
)
group by uid

这里将row_cnt放在了fans_uid之前,用冒号分隔,然后用sort_array函数对collect_list之后的结果进行排序(只支持升序)。特别注意,rank必须要在高位补足够的0对齐,因为排序的是字符串而不是数字,如果不补0的话,按字典序排序就会变成1, 10, 11, 12, 13, 2, 3, 4…,又不对了。
将排序的结果拼起来之后,用regexp_replace函数替换掉冒号及其前面的数字,大功告成。

解法三:最优解,写法简洁

1
2
3
4
5
6
7
8
9
10
11
select
uid,
collect_list(fans_uid) as fans_uid_list
from (
select
uid
,fans_uid
,score
from tb_user_fans distribute by uid sort by uid,score desc
)
group by uid

诀窍是使用带有DISTRIBUTE BY和SORT BY语句的子查询

实际场景中:这个最优解到了collect_list还是存在多个mapper顺序问题,collect_list()的底层是使⽤ArrayList来实现的,当put到这个ArrayList的时候,不⼀定哪个Mapper先,哪个Mapper后。所以还是需要使用第二种解法。