Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - how to obtain the latest record for each user in a hive table?


Copy link to this message
-
Re: how to obtain the latest record for each user in a hive table?
qiaoresearcher 2012-11-19, 23:30
Hi Edward,

thanks for you reply.

right now my solution is:
1) 1st step, run:   select concat(user_id,max(visiting_time)) from table ,
then we have something like:
user1-time13
user2-time22
 ....      ....
Then the results are put into some file, say, file_one.txt
2) 2nd step, run:  select user_id, case concat(user_id,visiting_time) is
found in file_one.txt then visitiing_web_page else null end from table
after this, we can run collect_set function.

These two steps is able to generate the results I want, however, when the
user id numbers is huge, for example, several billions of user ids,
file_one.txt becomes huge as well, step 2 has problem to finish, hive will
report problems and never finish....

is there any more efficient ways? it seems in SQL, there is function called
 row_number, does Hive have similar one? I assume the rank function you
provided below is something similar to row_number, right? i am not hadoop
administrator, can I run the rank function in hive?

thanks again!
On Mon, Nov 19, 2012 at 4:55 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher
> <[EMAIL PROTECTED]> wrote:
> > The table format is something like:
> >
> > user_id    visiting_time      visiting_web_page
> > user1          time11             page_string_11
> > user1          time12             page_string_12
> > user1          time13             page_string_13          <-- latest time
> > stamp
> > user1          time14             page_string_14          <-- also latest
> > time stamp, same as the one above
> > user1          time15             page_string_15
> >   ...                ...                       .....
> > user2          time21             page_string_21
> > user2          time22             page_string_22
> > user2          time23             page_string_23
> > user2          time24             page_string_24
> > user2          time25             page_string_25
> >    ....              .....                       ....
> >
> > Assume time13 and time14 are the latest time stamp for user1, time22 and
> > time25 are the latest time stamp for user2,
> > how to obtain output like:
> > user1        [page_string_13, page_string_14]
> > user2        [page_string_22, page_string_25]
> >
> > ps: run {   select user, max(visiting_time) group by user  }    WILL
> return
> > result like:
> > user1, time13   (same as time 14)
> > user2, time22   (same as time 25)
> >
> > many thanks in advance!
> >
>
> Part of the solution is to use hive collect_set UDF. This collects
> group into a hive array type.
>
> select userid,visitingtime, collect_Set(visiting_web_page) from table
> group by userid,visitingtime;
>
> Collect_set de-duplicates. If you do not want this use the collect_udf I
> wrote.
>
> https://github.com/edwardcapriolo/hive-collect
>
> If you want only the newest one you can use rank.
>
> https://github.com/edwardcapriolo/hive-rank
>