Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> how to obtain the latest record for each user in a hive table?


+
qiaoresearcher 2012-11-19, 21:02
+
Edward Capriolo 2012-11-19, 22:55
Copy link to this message
-
Re: how to obtain the latest record for each user in a hive table?
Hi Edward,

thanks for you reply.

right now my solution is:
1) 1st step, run:   select concat(user_id,max(visiting_time)) from table ,
then we have something like:
user1-time13
user2-time22
 ....      ....
Then the results are put into some file, say, file_one.txt
2) 2nd step, run:  select user_id, case concat(user_id,visiting_time) is
found in file_one.txt then visitiing_web_page else null end from table
after this, we can run collect_set function.

These two steps is able to generate the results I want, however, when the
user id numbers is huge, for example, several billions of user ids,
file_one.txt becomes huge as well, step 2 has problem to finish, hive will
report problems and never finish....

is there any more efficient ways? it seems in SQL, there is function called
 row_number, does Hive have similar one? I assume the rank function you
provided below is something similar to row_number, right? i am not hadoop
administrator, can I run the rank function in hive?

thanks again!
On Mon, Nov 19, 2012 at 4:55 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher
> <[EMAIL PROTECTED]> wrote:
> > The table format is something like:
> >
> > user_id    visiting_time      visiting_web_page
> > user1          time11             page_string_11
> > user1          time12             page_string_12
> > user1          time13             page_string_13          <-- latest time
> > stamp
> > user1          time14             page_string_14          <-- also latest
> > time stamp, same as the one above
> > user1          time15             page_string_15
> >   ...                ...                       .....
> > user2          time21             page_string_21
> > user2          time22             page_string_22
> > user2          time23             page_string_23
> > user2          time24             page_string_24
> > user2          time25             page_string_25
> >    ....              .....                       ....
> >
> > Assume time13 and time14 are the latest time stamp for user1, time22 and
> > time25 are the latest time stamp for user2,
> > how to obtain output like:
> > user1        [page_string_13, page_string_14]
> > user2        [page_string_22, page_string_25]
> >
> > ps: run {   select user, max(visiting_time) group by user  }    WILL
> return
> > result like:
> > user1, time13   (same as time 14)
> > user2, time22   (same as time 25)
> >
> > many thanks in advance!
> >
>
> Part of the solution is to use hive collect_set UDF. This collects
> group into a hive array type.
>
> select userid,visitingtime, collect_Set(visiting_web_page) from table
> group by userid,visitingtime;
>
> Collect_set de-duplicates. If you do not want this use the collect_udf I
> wrote.
>
> https://github.com/edwardcapriolo/hive-collect
>
> If you want only the newest one you can use rank.
>
> https://github.com/edwardcapriolo/hive-rank
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB