Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> reduce continuous sessions


Copy link to this message
-
RE: reduce continuous sessions
You might want to check out LinkedIn's DataFu contribution, particularly the "sessionize" UDF:
http://sna-projects.com/datafu/javadoc/0.0.4/datafu/pig/sessions/Sessionize.html
_____________
Steve Bernstein
VP, Analytics
Rearden Commerce, Inc.

+1.408.499.0961 Mobile

deem.com | reardencommerce.com

-----Original Message-----
From: Marco Cadetg [mailto:[EMAIL PROTECTED]]
Sent: Thursday, August 30, 2012 4:42 AM
To: [EMAIL PROTECTED]
Subject: Re: reduce continuous sessions

Unfortunately it's not that simple.

A = LOAD 'comb.txt' USING PigStorage(',') AS (id:chararray,start:long,end:long);
B = FOREACH (GROUP A BY id) { GENERATE
FLATTEN(group),MIN(A.start),MAX(A.end); } dump B
(xxx,1,7)
(yyy,1,7)
(zzz,6,10)

This is not what I want. I want only to reduce the rows / sessions if they are continues like the end of one session is the start of another. In my example that is:
xxx,1,3
xxx,4,7

This is continuous as the end of the first row is the start (+1s) of the next row.

Unlike this one, here the end of the first row is NOT the start of the next row...
yyy,1,2
yyy,5,7

Therefore I have to keep track of sessions somehow.

Cheers,
-Marco
On Thu, Aug 30, 2012 at 10:07 AM, Prashant Kommireddi
<[EMAIL PROTECTED]>wrote:

> Seems like you are looking to group by "id" and get the MIN and MAX
> timestamp for each group?
>
>
> On Thu, Aug 30, 2012 at 1:00 AM, Marco Cadetg <[EMAIL PROTECTED]> wrote:
>
> > Hi there,
> >
> > I do have some user session which look something on the following lines:
> >
> > id:chararray, start:long(unix timestamp), end:long(unix timestamp)
> > xxx,1,3
> > xxx,4,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,7
> > zzz,7,10
> >
> > I would like to to combine the rows which belong to a continues
> > session e.g. in my example the result should be the following:
> > xxx,1,7
> > yyy,1,2
> > yyy,5,7
> > zzz,6,10
> >
> > I guess there is no way to do this directly in pig but rather by
> > using a UDF. Can someone give me a pointer on how you would achieve this?
> >
> > Thanks,
> > -Marco
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB