Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem with dereferencing and alias


Copy link to this message
-
Re: Problem with dereferencing and alias
Hi,

the fact is that visit is a nested tuple inside the tuples that make your
original relation.
If you describe the data2 relation it should get clear:

WITH FLATTEN
grunt> describe data2
data2: {visit::visitorid: bytearray,visit::visitid:
bytearray,visit::browser: bytearray}

WITHOUT FLATTEN
grunt> data2 = foreach data generate visit;
grunt> describe data2
data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}

If you don't want to flatten (for whichever reason), you need to modify
your script like this:

data3 = group data2 by visit.browser;

But then you have a double nesting which I find cumbersome to work with.
grunt> describe data3
data3: {group: bytearray,data2: {(visit: (visitorid: bytearray,visitid:
bytearray,browser: bytearray))}}

Now you have data3 which is a bag with a nested bag data2 with a nested
tuple which contains a 3 element tuple.

That's why flattening comes handy in this case.

I hope it helps.

Cheers,
--
Gianmarco

On Mon, Apr 23, 2012 at 21:05, Mustafi, Priyo <[EMAIL PROTECTED]> wrote:

> Hi All,
> I am pretty new to pig and am having some issues with dereferencing. My
> data in simplified form looks like below
>
> data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid,
> visitid, browser), events:bag{event:tuple(pagename, pagevar)});
>
> cat visitevent   (note there is tab in between the visit and the events)
> (vr1,vi1,ff)    {((pagea,eb1)),((pageb,eb3))}
> (vr1,vi2,ff)    {((pageb,eb2))}
> (vr2,vi3,ff)    {((pageb,eb4))}
> (vr3,vi4,ie)    {((pagec,eb3)),((pagea,eb5))}
>
>
> My task is the following
> 1)  Generate count(visitid) and count(distinct visitorid) by browser
> 2)  Generate count(events), count(visitid) and count(distinct visitorid)
> by pagename
>
>
> I have issues with the first task.  I tried the below after flattening
> visit and it worked.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate FLATTEN(visit);
> data3 = group data2 by browser;
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> describe dc;
> dump dc;
>
>
> I don't understand why I would need to flatten visit.  I tried the below
> without flattening and whatever I try it doesn't work. Not sure why.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate visit;
> data3 = group data2 by browser;
> #  describe data3  produces below
> #       data3: {group: bytearray,data2: {visit: (visitorid:
> bytearray,visitid: bytearray,browser: bytearray)}}
> #  none of the below work as somehow it doesn't find the alias.  Why?
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
>
> What am I doing wrong?  Since my task #2 is going to group by pagename
> which is in a bag->tuple, do I have to flatten that one twice to get this
> working? Are there any documentation on dereferencing complex and nested
> structures?  Any help appreciated.
>
> Thanks
> Priyo
>
>
>
>