Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> how to obtain the latest record for each user in a hive table?


Copy link to this message
-
Re: how to obtain the latest record for each user in a hive table?
On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher
<[EMAIL PROTECTED]> wrote:
> The table format is something like:
>
> user_id    visiting_time      visiting_web_page
> user1          time11             page_string_11
> user1          time12             page_string_12
> user1          time13             page_string_13          <-- latest time
> stamp
> user1          time14             page_string_14          <-- also latest
> time stamp, same as the one above
> user1          time15             page_string_15
>   ...                ...                       .....
> user2          time21             page_string_21
> user2          time22             page_string_22
> user2          time23             page_string_23
> user2          time24             page_string_24
> user2          time25             page_string_25
>    ....              .....                       ....
>
> Assume time13 and time14 are the latest time stamp for user1, time22 and
> time25 are the latest time stamp for user2,
> how to obtain output like:
> user1        [page_string_13, page_string_14]
> user2        [page_string_22, page_string_25]
>
> ps: run {   select user, max(visiting_time) group by user  }    WILL return
> result like:
> user1, time13   (same as time 14)
> user2, time22   (same as time 25)
>
> many thanks in advance!
>

Part of the solution is to use hive collect_set UDF. This collects
group into a hive array type.

select userid,visitingtime, collect_Set(visiting_web_page) from table
group by userid,visitingtime;

Collect_set de-duplicates. If you do not want this use the collect_udf I wrote.

https://github.com/edwardcapriolo/hive-collect

If you want only the newest one you can use rank.

https://github.com/edwardcapriolo/hive-rank
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB