|
|
-
Re: how to obtain the latest record for each user in a hive table?qiaoresearcher 2012-11-19, 23:30
Hi Edward,
thanks for you reply. right now my solution is: 1) 1st step, run: select concat(user_id,max(visiting_time)) from table , then we have something like: user1-time13 user2-time22 .... .... Then the results are put into some file, say, file_one.txt 2) 2nd step, run: select user_id, case concat(user_id,visiting_time) is found in file_one.txt then visitiing_web_page else null end from table after this, we can run collect_set function. These two steps is able to generate the results I want, however, when the user id numbers is huge, for example, several billions of user ids, file_one.txt becomes huge as well, step 2 has problem to finish, hive will report problems and never finish.... is there any more efficient ways? it seems in SQL, there is function called row_number, does Hive have similar one? I assume the rank function you provided below is something similar to row_number, right? i am not hadoop administrator, can I run the rank function in hive? thanks again! On Mon, Nov 19, 2012 at 4:55 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher > <[EMAIL PROTECTED]> wrote: > > The table format is something like: > > > > user_id visiting_time visiting_web_page > > user1 time11 page_string_11 > > user1 time12 page_string_12 > > user1 time13 page_string_13 <-- latest time > > stamp > > user1 time14 page_string_14 <-- also latest > > time stamp, same as the one above > > user1 time15 page_string_15 > > ... ... ..... > > user2 time21 page_string_21 > > user2 time22 page_string_22 > > user2 time23 page_string_23 > > user2 time24 page_string_24 > > user2 time25 page_string_25 > > .... ..... .... > > > > Assume time13 and time14 are the latest time stamp for user1, time22 and > > time25 are the latest time stamp for user2, > > how to obtain output like: > > user1 [page_string_13, page_string_14] > > user2 [page_string_22, page_string_25] > > > > ps: run { select user, max(visiting_time) group by user } WILL > return > > result like: > > user1, time13 (same as time 14) > > user2, time22 (same as time 25) > > > > many thanks in advance! > > > > Part of the solution is to use hive collect_set UDF. This collects > group into a hive array type. > > select userid,visitingtime, collect_Set(visiting_web_page) from table > group by userid,visitingtime; > > Collect_set de-duplicates. If you do not want this use the collect_udf I > wrote. > > https://github.com/edwardcapriolo/hive-collect > > If you want only the newest one you can use rank. > > https://github.com/edwardcapriolo/hive-rank > |