Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - need help on writing hive query


Copy link to this message
-
RE: need help on writing hive query
java8964 java8964 2012-10-31, 23:08

If you don't need to join current_web_page and previous_web_page, assuming you can just trust the time stamp, as Phil points out, an custom UDF of collect_list() is the way to go.
You need to implement collect_list() UDF by yourself, hive doesn't have one by default.But it should be straight forward. In fact, you can reuse the code of collect_set(), replace the internal set with a Java ArrayList, then
select user_id, collect_list(user_current_web_page)from(    select user_id, user_current_web_page    order by user_id asc, user_visiting_time asc)agroup by user_id
Yong
> Subject: Re: need help on writing hive query
> From: [EMAIL PROTECTED]
> Date: Wed, 31 Oct 2012 17:53:06 -0400
> To: [EMAIL PROTECTED]
>
> I did a similar query a few months ago.  In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array().  There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order.
>
> If there's an easier way, please let me know.
>
> Matt Tucker
>
> On Oct 31, 2012, at 5:37 PM, Tom Brown <[EMAIL PROTECTED]> wrote:
>
> > It wouldn't retrieve the user's path in a single string, but you could
> > simply select the user id and current page, ordered by the timestamp.
> >
> > It would require a second step to turn it into the single string path,
> > so that might be a deal-breaker.
> >
> > --Tom
> >
> > On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans
> > <[EMAIL PROTECTED]> wrote:
> >> You could use collect_set() and GROUP BY. That wouldn't preserve order
> >> though.
> >>
> >> Phil.
> >>
> >> On Oct 31, 2012 9:18 PM, "qiaoresearcher" <[EMAIL PROTECTED]> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> here is the question. Assume we have a table like:
> >>>
> >>> ------------------------------------------------------------------------------------------------------------------------------
> >>> user_id    ||  user_visiting_time    ||      user_current_web_page     ||
> >>> user_previous_web_page
> >>> user 1                 time (1,1)                                   page 1
> >>> page 0
> >>> user 1                 time (1,2)                                   page 2
> >>> page 1
> >>> user 1                 time (1,3 )                                  page 3
> >>> page 2
> >>> .....                          ......
> >>> ....                                                ....
> >>> user n                 time (n,1)                                   page 1
> >>> page 0
> >>> user n                 time (n,2)                                   page 2
> >>> page 1
> >>> user n                 time (n,3)                                   page 3
> >>> page 2
> >>>
> >>> that is, in each row, we know the current web page that user is viewing,
> >>> and we know the previous web page the user coming from
> >>>
> >>> now we want to generate a list for each user that recorded the complete
> >>> path the user is taking:
> >>> i.e., how can we use hive to generate output like:
> >>>
> >>> ------------------------------------------------------------------------------------------------
> >>> user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
> >>> beginning page of user 1)
> >>> user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
> >>> the beginning page of user 2)
> >>> the web pages viewed by user 1 and user 2 might be different.
> >>>
> >>> can we generate this using hive?
> >>>
> >>> thanks,