|
|
-
need help on writing hive query
qiaoresearcher 2012-10-31, 21:18
Hi all,
here is the question. Assume we have a table like: ------------------------------------------------------------------------------------------------------------------------------ user_id || user_visiting_time || user_current_web_page || user_previous_web_page user 1 time (1,1) page 1 page 0 user 1 time (1,2) page 2 page 1 user 1 time (1,3 ) page 3 page 2 ..... ...... .... .... user n time (n,1) page 1 page 0 user n time (n,2) page 2 page 1 user n time (n,3) page 3 page 2
that is, in each row, we know the current web page that user is viewing, and we know the previous web page the user coming from
now we want to generate a list for each user that recorded the complete path the user is taking: i.e., how can we use hive to generate output like: ------------------------------------------------------------------------------------------------ user 1 : page 1 page 2 page 3 page 4 .......... (till reach the beginning page of user 1) user 2: page 1 page 2 page 3 page 4 page 5 ....... ( till reach the beginning page of user 2) the web pages viewed by user 1 and user 2 might be different.
can we generate this using hive?
thanks,
-
Re: need help on writing hive query
Mark Grover 2012-10-31, 21:29
You should look into Hive's cluster by/distribute by functionality. https://cwiki.apache.org/Hive/languagemanual-sortby.html#LanguageManualSortBy-SyntaxofClusterByandDistributeByhttps://cwiki.apache.org/Hive/languagemanual-transform.htmlOn Wed, Oct 31, 2012 at 2:18 PM, qiaoresearcher <[EMAIL PROTECTED]>wrote: > Hi all, > > here is the question. Assume we have a table like: > > ------------------------------------------------------------------------------------------------------------------------------ > user_id || user_visiting_time || user_current_web_page || > user_previous_web_page > user 1 time (1,1) page 1 > page 0 > user 1 time (1,2) page 2 > page 1 > user 1 time (1,3 ) page 3 > page 2 > ..... ...... > .... .... > user n time (n,1) page 1 > page 0 > user n time (n,2) page 2 > page 1 > user n time (n,3) page 3 > page 2 > > that is, in each row, we know the current web page that user is viewing, > and we know the previous web page the user coming from > > now we want to generate a list for each user that recorded the complete > path the user is taking: > i.e., how can we use hive to generate output like: > > ------------------------------------------------------------------------------------------------ > user 1 : page 1 page 2 page 3 page 4 .......... (till reach the > beginning page of user 1) > user 2: page 1 page 2 page 3 page 4 page 5 ....... ( till reach > the beginning page of user 2) > the web pages viewed by user 1 and user 2 might be different. > > can we generate this using hive? > > thanks, >
-
Re: need help on writing hive query
Philip Tromans 2012-10-31, 21:32
You could use collect_set() and GROUP BY. That wouldn't preserve order though.
Phil. On Oct 31, 2012 9:18 PM, "qiaoresearcher" <[EMAIL PROTECTED]> wrote:
> Hi all, > > here is the question. Assume we have a table like: > > ------------------------------------------------------------------------------------------------------------------------------ > user_id || user_visiting_time || user_current_web_page || > user_previous_web_page > user 1 time (1,1) page 1 > page 0 > user 1 time (1,2) page 2 > page 1 > user 1 time (1,3 ) page 3 > page 2 > ..... ...... > .... .... > user n time (n,1) page 1 > page 0 > user n time (n,2) page 2 > page 1 > user n time (n,3) page 3 > page 2 > > that is, in each row, we know the current web page that user is viewing, > and we know the previous web page the user coming from > > now we want to generate a list for each user that recorded the complete > path the user is taking: > i.e., how can we use hive to generate output like: > > ------------------------------------------------------------------------------------------------ > user 1 : page 1 page 2 page 3 page 4 .......... (till reach the > beginning page of user 1) > user 2: page 1 page 2 page 3 page 4 page 5 ....... ( till reach > the beginning page of user 2) > the web pages viewed by user 1 and user 2 might be different. > > can we generate this using hive? > > thanks, >
-
Re: need help on writing hive query
Tom Brown 2012-10-31, 21:37
It wouldn't retrieve the user's path in a single string, but you could simply select the user id and current page, ordered by the timestamp.
It would require a second step to turn it into the single string path, so that might be a deal-breaker.
--Tom
On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans <[EMAIL PROTECTED]> wrote: > You could use collect_set() and GROUP BY. That wouldn't preserve order > though. > > Phil. > > On Oct 31, 2012 9:18 PM, "qiaoresearcher" <[EMAIL PROTECTED]> wrote: >> >> Hi all, >> >> here is the question. Assume we have a table like: >> >> ------------------------------------------------------------------------------------------------------------------------------ >> user_id || user_visiting_time || user_current_web_page || >> user_previous_web_page >> user 1 time (1,1) page 1 >> page 0 >> user 1 time (1,2) page 2 >> page 1 >> user 1 time (1,3 ) page 3 >> page 2 >> ..... ...... >> .... .... >> user n time (n,1) page 1 >> page 0 >> user n time (n,2) page 2 >> page 1 >> user n time (n,3) page 3 >> page 2 >> >> that is, in each row, we know the current web page that user is viewing, >> and we know the previous web page the user coming from >> >> now we want to generate a list for each user that recorded the complete >> path the user is taking: >> i.e., how can we use hive to generate output like: >> >> ------------------------------------------------------------------------------------------------ >> user 1 : page 1 page 2 page 3 page 4 .......... (till reach the >> beginning page of user 1) >> user 2: page 1 page 2 page 3 page 4 page 5 ....... ( till reach >> the beginning page of user 2) >> the web pages viewed by user 1 and user 2 might be different. >> >> can we generate this using hive? >> >> thanks,
-
Re: need help on writing hive query
Matt Tucker 2012-10-31, 21:53
I did a similar query a few months ago. In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array(). There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order.
If there's an easier way, please let me know.
Matt Tucker
On Oct 31, 2012, at 5:37 PM, Tom Brown <[EMAIL PROTECTED]> wrote:
> It wouldn't retrieve the user's path in a single string, but you could > simply select the user id and current page, ordered by the timestamp. > > It would require a second step to turn it into the single string path, > so that might be a deal-breaker. > > --Tom > > On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans > <[EMAIL PROTECTED]> wrote: >> You could use collect_set() and GROUP BY. That wouldn't preserve order >> though. >> >> Phil. >> >> On Oct 31, 2012 9:18 PM, "qiaoresearcher" <[EMAIL PROTECTED]> wrote: >>> >>> Hi all, >>> >>> here is the question. Assume we have a table like: >>> >>> ------------------------------------------------------------------------------------------------------------------------------ >>> user_id || user_visiting_time || user_current_web_page || >>> user_previous_web_page >>> user 1 time (1,1) page 1 >>> page 0 >>> user 1 time (1,2) page 2 >>> page 1 >>> user 1 time (1,3 ) page 3 >>> page 2 >>> ..... ...... >>> .... .... >>> user n time (n,1) page 1 >>> page 0 >>> user n time (n,2) page 2 >>> page 1 >>> user n time (n,3) page 3 >>> page 2 >>> >>> that is, in each row, we know the current web page that user is viewing, >>> and we know the previous web page the user coming from >>> >>> now we want to generate a list for each user that recorded the complete >>> path the user is taking: >>> i.e., how can we use hive to generate output like: >>> >>> ------------------------------------------------------------------------------------------------ >>> user 1 : page 1 page 2 page 3 page 4 .......... (till reach the >>> beginning page of user 1) >>> user 2: page 1 page 2 page 3 page 4 page 5 ....... ( till reach >>> the beginning page of user 2) >>> the web pages viewed by user 1 and user 2 might be different. >>> >>> can we generate this using hive? >>> >>> thanks,
-
RE: need help on writing hive query
java8964 java8964 2012-10-31, 23:08
If you don't need to join current_web_page and previous_web_page, assuming you can just trust the time stamp, as Phil points out, an custom UDF of collect_list() is the way to go. You need to implement collect_list() UDF by yourself, hive doesn't have one by default.But it should be straight forward. In fact, you can reuse the code of collect_set(), replace the internal set with a Java ArrayList, then select user_id, collect_list(user_current_web_page)from( select user_id, user_current_web_page order by user_id asc, user_visiting_time asc)agroup by user_id Yong > Subject: Re: need help on writing hive query > From: [EMAIL PROTECTED] > Date: Wed, 31 Oct 2012 17:53:06 -0400 > To: [EMAIL PROTECTED] > > I did a similar query a few months ago. In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array(). There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order. > > If there's an easier way, please let me know. > > Matt Tucker > > On Oct 31, 2012, at 5:37 PM, Tom Brown <[EMAIL PROTECTED]> wrote: > > > It wouldn't retrieve the user's path in a single string, but you could > > simply select the user id and current page, ordered by the timestamp. > > > > It would require a second step to turn it into the single string path, > > so that might be a deal-breaker. > > > > --Tom > > > > On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans > > <[EMAIL PROTECTED]> wrote: > >> You could use collect_set() and GROUP BY. That wouldn't preserve order > >> though. > >> > >> Phil. > >> > >> On Oct 31, 2012 9:18 PM, "qiaoresearcher" <[EMAIL PROTECTED]> wrote: > >>> > >>> Hi all, > >>> > >>> here is the question. Assume we have a table like: > >>> > >>> ------------------------------------------------------------------------------------------------------------------------------ > >>> user_id || user_visiting_time || user_current_web_page || > >>> user_previous_web_page > >>> user 1 time (1,1) page 1 > >>> page 0 > >>> user 1 time (1,2) page 2 > >>> page 1 > >>> user 1 time (1,3 ) page 3 > >>> page 2 > >>> ..... ...... > >>> .... .... > >>> user n time (n,1) page 1 > >>> page 0 > >>> user n time (n,2) page 2 > >>> page 1 > >>> user n time (n,3) page 3 > >>> page 2 > >>> > >>> that is, in each row, we know the current web page that user is viewing, > >>> and we know the previous web page the user coming from > >>> > >>> now we want to generate a list for each user that recorded the complete > >>> path the user is taking: > >>> i.e., how can we use hive to generate output like: > >>> > >>> ------------------------------------------------------------------------------------------------ > >>> user 1 : page 1 page 2 page 3 page 4 .......... (till reach the > >>> beginning page of user 1) > >>> user 2: page 1 page 2 page 3 page 4 page 5 ....... ( till reach > >>> the beginning page of user 2) > >>> the web pages viewed by user 1 and user 2 might be different. > >>> > >>> can we generate this using hive? > >>> > >>> thanks,
-
need help on writing hive query
qiaoresearcher 2012-11-03, 03:35
The table format is something like:
user_id visiting_time visiting_web_page user1 time11 page_string_11 user1 time12 page_string_12 with keyword 'abc' user1 time13 page_string_13 user1 time14 page_string_14 with keyword 'cde' user1 time15 page_string_15 ... ... ..... user2 time21 page_string_21 user2 time22 page_string_22 user2 time23 page_string_23 with keyword 'abc' user2 time24 page_string_24 user2 time25 page_string_25 with keyword 'cde' .... ..... ....
how to obtain output like: user1 time14 - time12 user2 time25 - time23 ... .............
i.e, we need to first search keyword 'abc' in the visiting_web_page column to locate the first visiting_time, then search for the second visiting_time with keyword 'cde', and repeat the procedure for each user...
thanks in advance!
-
RE: need help on writing hive query
java8964 java8964 2012-11-03, 23:48
This is not a hive but a SQL question. You need to be more clear about your data, and try to think a way to solve your problem. Without the detail about your data, no easy way to answer your question. For example, just based on your example data you provide, does the 'abc' and 'cde' only happen once per user? Does the 'abc' timestamp is always before or after 'cde' timestamp? There is a lot of ways to archive your output, but you need to confirm your data. 1) Assume that each user will only have one 'abc' and 'cde', and if the 'abc' timestamp is always before the 'cde' timestamp, then it is very easy: select user_id, min(visting_time) as cde_time, max(visting_time) as abc_timefrom tablewhere visting_web_page contains 'abc' or visting_web_page contains 'cde'group by user_id
2) If the 'abc' and 'cde' visiting_time is randome, but each user will only have one 'abc' and 'cde', it is still not too hard, you just use the case() + max() from hive: select user_id, max( case when visting_web_page contains 'cde' then visting_time else null end ) as cde_time, max( case when visting_web_page contains 'abc' then visting_time else null end ) as abc_timefrom tablegroup by user_id The trick is simple, you use case function to return timestamp if it is the 'cde' row, else return null; then apply max on top of it. Since everything is larger than NULL in max, you will get the cde time stamp. 3) If each user could have more than one 'abc' or 'cde', then you need to decide which one you want it out. Yong Date: Fri, 2 Nov 2012 22:35:42 -0500 Subject: need help on writing hive query From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]
The table format is something like: user_id á ávisiting_time á á ávisiting_web_pageuser1 á á á á átime11 á á á á á á page_string_11 user1 á á á á átime12 á á á á á á page_string_12 á á with keyword 'abc' user1 á á á á átime13 á á á á á á page_string_13 user1 á á á á átime14 á á á á á á page_string_14 á á with keyword 'cde' user1 á á á á átime15 á á á á á á page_string_15
á ... á á á á á á á á... á á á á á á á á á á á .....áuser2 á á á á átime21 á á á á á á page_string_21 user2 á á á á átime22 á á á á á á page_string_22 á ááuser2 á á á á átime23 á á á á á á page_string_23 á ááwith keyword 'abc' user2 á á á á átime24 á á á á á á page_string_24 á áá user2 á á á á átime25 á á á á á á page_string_25 á ááwith keyword 'cde'á á.... á á á á á á á..... á á á á á á á á á á á .... á á á á
how to obtain output like: áuser1 á átime14 - time12 ááuser2 á átime25 - time23á á... á á á á á ............. i.e, we need to first search keyword 'abc' in the visiting_web_page column to locate the first visiting_time, then search for the second visiting_time with keyword 'cde', and repeat the procedure for each user...
thanks in advance! á á á
|
|