Hive中collect_list()排序问题详解

来看一道互联网公司的面试题：

1
2
3

有个用户好友表：字段如下
uid  fans_uid   score
返回：uid, fans_uid_list【fans_uid的拼接串，按照score降序拼接】

给出数据源：每个uid，有很多对应的fans_uid，每个fans_uid 都对应一个score，我们需要按uid分组，将fans_uid 的score按降序排序，将fans_uid 放在一个列表中，做好友推荐

create temporary table tb_user_fans as 
select 1 as uid,'a' as fans_uid,3 as score
union all
select 1 as uid,'b' as fans_uid,1 as score
union all
select 1 as uid,'c' as fans_uid,4 as score
union all
select 1 as uid,'d' as fans_uid,3 as score
union all
select 1 as uid,'e' as fans_uid,2 as score
union all
select 2 as uid,'a' as fans_uid,4 as score
union all
select 2 as uid,'b' as fans_uid,3 as score
union all
select 2 as uid,'c' as fans_uid,1 as score
union all
select 2 as uid,'d' as fans_uid,2 as score
union all
select 2 as uid,'e' as fans_uid,5 as score
union all
select 3 as uid,'a' as fans_uid,6 as score
union all
select 3 as uid,'b' as fans_uid,3 as score
union all
select 3 as uid,'c' as fans_uid,5 as score
;

我想要的结果：按照uid分组，按照score降序排序，取出fans_uid放到列表里

1
2
3

1	[c,d,a,e,b]
2	[e,a,b,d,c]
3	[a,c,b]

解法一：

select
    uid,
    collect_list(fans_uid) as fans_uid_list
from (
select 
    uid
    ,fans_uid
    ,score
    ,row_cnt
from (
    select
        uid
        ,fans_uid
        ,score
        ,row_number() over (partition by uid order by score desc ) as row_cnt
    from tb_user_fans
    ) 
order by row_cnt asc
)
group by uid
order by uid

如果数据量大不推荐使用，因为使用到了全局排序order by，只有一个reducer，那么数据量大计算要很长时间。

解法二：

select 
    uid
    ,regexp_replace(
        concat_ws(
            ','
            ,sort_array(
                collect_list(
                    conact_ws(':' ,lpad(cast(rank_num as string),5,'0') ,fans_uid)
                )
            )
        )
        ,'\\d+\:'
        ,''
    ) as fans_uid_list
from (
    select
        uid
        ,fans_uid
        ,score
        ,row_number() over (partition by uid order by score desc ) as row_cnt
    from tb_user_fans
)
group by uid

这里将row_cnt放在了fans_uid之前，用冒号分隔，然后用sort_array函数对collect_list之后的结果进行排序（只支持升序）。特别注意，rank必须要在高位补足够的0对齐，因为排序的是字符串而不是数字，如果不补0的话，按字典序排序就会变成1, 10, 11, 12, 13, 2, 3, 4…，又不对了。
将排序的结果拼起来之后，用regexp_replace函数替换掉冒号及其前面的数字，大功告成。

解法三：最优解，写法简洁

select
    uid,
    collect_list(fans_uid) as fans_uid_list
from (
	select 
	    uid
	    ,fans_uid
	    ,score
	from tb_user_fans distribute by uid sort by uid,score desc 
)
group by uid