Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - reduce continuous sessions


+
Marco Cadetg 2012-08-30, 08:00
+
Prashant Kommireddi 2012-08-30, 08:07
+
Marco Cadetg 2012-08-30, 11:41
Copy link to this message
-
RE: reduce continuous sessions
Steve Bernstein 2012-08-30, 16:02
You might want to check out LinkedIn's DataFu contribution, particularly the "sessionize" UDF:
http://sna-projects.com/datafu/javadoc/0.0.4/datafu/pig/sessions/Sessionize.html
_____________
Steve Bernstein
VP, Analytics
Rearden Commerce, Inc.

+1.408.499.0961 Mobile

deem.com | reardencommerce.com

-----Original Message-----
From: Marco Cadetg [mailto:[EMAIL PROTECTED]]
Sent: Thursday, August 30, 2012 4:42 AM
To: [EMAIL PROTECTED]
Subject: Re: reduce continuous sessions

Unfortunately it's not that simple.

A = LOAD 'comb.txt' USING PigStorage(',') AS (id:chararray,start:long,end:long);
B = FOREACH (GROUP A BY id) { GENERATE
FLATTEN(group),MIN(A.start),MAX(A.end); } dump B
(xxx,1,7)
(yyy,1,7)
(zzz,6,10)

This is not what I want. I want only to reduce the rows / sessions if they are continues like the end of one session is the start of another. In my example that is:
xxx,1,3
xxx,4,7

This is continuous as the end of the first row is the start (+1s) of the next row.

Unlike this one, here the end of the first row is NOT the start of the next row...
yyy,1,2
yyy,5,7

Therefore I have to keep track of sessions somehow.

Cheers,
-Marco
On Thu, Aug 30, 2012 at 10:07 AM, Prashant Kommireddi
<[EMAIL PROTECTED]>wrote:

> Seems like you are looking to group by "id" and get the MIN and MAX
> timestamp for each group?
>
>
> On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:
>
> > Hi there,
> >
> > I do have some user session which look something on the following lines:
> >
> > id:chararray, start:long(unix timestamp), end:long(unix timestamp)
> > xxx,1,3
> > xxx,4,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,7
> > zzz,7,10
> >
> > I would like to to combine the rows which belong to a continues
> > session e.g. in my example the result should be the following:
> > xxx,1,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,10
> >
> > I guess there is no way to do this directly in pig but rather by
> > using a UDF. Can someone give me a pointer on how you would achieve this?
> >
> > Thanks,
> > -Marco
> >
>