Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem with dereferencing and alias


Copy link to this message
-
Re: Problem with dereferencing and alias
Hi,

the fact is that visit is a nested tuple inside the tuples that make your
original relation.
If you describe the data2 relation it should get clear:

WITH FLATTEN
grunt> describe data2
data2: {visit::visitorid: bytearray,visit::visitid:
bytearray,visit::browser: bytearray}

WITHOUT FLATTEN
grunt> data2 = foreach data generate visit;
grunt> describe data2
data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}

If you don't want to flatten (for whichever reason), you need to modify
your script like this:

data3 = group data2 by visit.browser;

But then you have a double nesting which I find cumbersome to work with.
grunt> describe data3
data3: {group: bytearray,data2: {(visit: (visitorid: bytearray,visitid:
bytearray,browser: bytearray))}}

Now you have data3 which is a bag with a nested bag data2 with a nested
tuple which contains a 3 element tuple.

That's why flattening comes handy in this case.

I hope it helps.

Cheers,
--
Gianmarco

On Mon, Apr 23, 2012 at 21:05, Mustafi, Priyo <[EMAIL PROTECTED]> wrote:

> Hi All,
> I am pretty new to pig and am having some issues with dereferencing. My
> data in simplified form looks like below
>
> data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid,
> visitid, browser), events:bag{event:tuple(pagename, pagevar)});
>
> cat visitevent   (note there is tab in between the visit and the events)
> (vr1,vi1,ff)    {((pagea,eb1)),((pageb,eb3))}
> (vr1,vi2,ff)    {((pageb,eb2))}
> (vr2,vi3,ff)    {((pageb,eb4))}
> (vr3,vi4,ie)    {((pagec,eb3)),((pagea,eb5))}
>
>
> My task is the following
> 1)  Generate count(visitid) and count(distinct visitorid) by browser
> 2)  Generate count(events), count(visitid) and count(distinct visitorid)
> by pagename
>
>
> I have issues with the first task.  I tried the below after flattening
> visit and it worked.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate FLATTEN(visit);
> data3 = group data2 by browser;
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> describe dc;
> dump dc;
>
>
> I don't understand why I would need to flatten visit.  I tried the below
> without flattening and whatever I try it doesn't work. Not sure why.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate visit;
> data3 = group data2 by browser;
> #  describe data3  produces below
> #       data3: {group: bytearray,data2: {visit: (visitorid:
> bytearray,visitid: bytearray,browser: bytearray)}}
> #  none of the below work as somehow it doesn't find the alias.  Why?
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
>
> What am I doing wrong?  Since my task #2 is going to group by pagename
> which is in a bag->tuple, do I have to flatten that one twice to get this
> working? Are there any documentation on dereferencing complex and nested
> structures?  Any help appreciated.
>
> Thanks
> Priyo
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB