|
|
-
reduce continuous sessions
Marco Cadetg 2012-08-30, 08:00
Hi there,
I do have some user session which look something on the following lines:
id:chararray, start:long(unix timestamp), end:long(unix timestamp) xxx,1,3 xxx,4,7 yyy,1,2 yyy,5,7 zzz,6,7 zzz,7,10
I would like to to combine the rows which belong to a continues session e.g. in my example the result should be the following: xxx,1,7 yyy,1,2 yyy,5,7 zzz,6,10
I guess there is no way to do this directly in pig but rather by using a UDF. Can someone give me a pointer on how you would achieve this?
Thanks, -Marco
-
Re: reduce continuous sessions
Prashant Kommireddi 2012-08-30, 08:07
Seems like you are looking to group by "id" and get the MIN and MAX timestamp for each group? On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:
> Hi there, > > I do have some user session which look something on the following lines: > > id:chararray, start:long(unix timestamp), end:long(unix timestamp) > xxx,1,3 > xxx,4,7 > yyy,1,2 > yyy,5,7 > zzz,6,7 > zzz,7,10 > > I would like to to combine the rows which belong to a continues session > e.g. in my example the result should be the following: > xxx,1,7 > yyy,1,2 > yyy,5,7 > zzz,6,10 > > I guess there is no way to do this directly in pig but rather by using a > UDF. Can someone give me a pointer on how you would achieve this? > > Thanks, > -Marco >
-
Re: reduce continuous sessions
Marco Cadetg 2012-08-30, 11:41
Unfortunately it's not that simple.
A = LOAD 'comb.txt' USING PigStorage(',') AS (id:chararray,start:long,end:long); B = FOREACH (GROUP A BY id) { GENERATE FLATTEN(group),MIN(A.start),MAX(A.end); } dump B (xxx,1,7) (yyy,1,7) (zzz,6,10)
This is not what I want. I want only to reduce the rows / sessions if they are continues like the end of one session is the start of another. In my example that is: xxx,1,3 xxx,4,7
This is continuous as the end of the first row is the start (+1s) of the next row.
Unlike this one, here the end of the first row is NOT the start of the next row... yyy,1,2 yyy,5,7
Therefore I have to keep track of sessions somehow.
Cheers, -Marco On Thu, Aug 30, 2012 at 10:07 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> Seems like you are looking to group by "id" and get the MIN and MAX > timestamp for each group? > > > On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > > > Hi there, > > > > I do have some user session which look something on the following lines: > > > > id:chararray, start:long(unix timestamp), end:long(unix timestamp) > > xxx,1,3 > > xxx,4,7 > > yyy,1,2 > > yyy,5,7 > > zzz,6,7 > > zzz,7,10 > > > > I would like to to combine the rows which belong to a continues session > > e.g. in my example the result should be the following: > > xxx,1,7 > > yyy,1,2 > > yyy,5,7 > > zzz,6,10 > > > > I guess there is no way to do this directly in pig but rather by using a > > UDF. Can someone give me a pointer on how you would achieve this? > > > > Thanks, > > -Marco > > >
-
RE: reduce continuous sessions
Steve Bernstein 2012-08-30, 16:02
You might want to check out LinkedIn's DataFu contribution, particularly the "sessionize" UDF: http://sna-projects.com/datafu/javadoc/0.0.4/datafu/pig/sessions/Sessionize.html_____________ Steve Bernstein VP, Analytics Rearden Commerce, Inc. +1.408.499.0961 Mobile deem.com | reardencommerce.com -----Original Message----- From: Marco Cadetg [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 30, 2012 4:42 AM To: [EMAIL PROTECTED] Subject: Re: reduce continuous sessions Unfortunately it's not that simple. A = LOAD 'comb.txt' USING PigStorage(',') AS (id:chararray,start:long,end:long); B = FOREACH (GROUP A BY id) { GENERATE FLATTEN(group),MIN(A.start),MAX(A.end); } dump B (xxx,1,7) (yyy,1,7) (zzz,6,10) This is not what I want. I want only to reduce the rows / sessions if they are continues like the end of one session is the start of another. In my example that is: xxx,1,3 xxx,4,7 This is continuous as the end of the first row is the start (+1s) of the next row. Unlike this one, here the end of the first row is NOT the start of the next row... yyy,1,2 yyy,5,7 Therefore I have to keep track of sessions somehow. Cheers, -Marco On Thu, Aug 30, 2012 at 10:07 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Seems like you are looking to group by "id" and get the MIN and MAX > timestamp for each group? > > > On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote: > > > Hi there, > > > > I do have some user session which look something on the following lines: > > > > id:chararray, start:long(unix timestamp), end:long(unix timestamp) > > xxx,1,3 > > xxx,4,7 > > yyy,1,2 > > yyy,5,7 > > zzz,6,7 > > zzz,7,10 > > > > I would like to to combine the rows which belong to a continues > > session e.g. in my example the result should be the following: > > xxx,1,7 > > yyy,1,2 > > yyy,5,7 > > zzz,6,10 > > > > I guess there is no way to do this directly in pig but rather by > > using a UDF. Can someone give me a pointer on how you would achieve this? > > > > Thanks, > > -Marco > > >
|
|