|
|
-
Problem with dereferencing and alias
Mustafi, Priyo 2012-04-23, 19:05
Hi All, I am pretty new to pig and am having some issues with dereferencing. My data in simplified form looks like below
data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, pagevar)});
cat visitevent (note there is tab in between the visit and the events) (vr1,vi1,ff) {((pagea,eb1)),((pageb,eb3))} (vr1,vi2,ff) {((pageb,eb2))} (vr2,vi3,ff) {((pageb,eb4))} (vr3,vi4,ie) {((pagec,eb3)),((pagea,eb5))} My task is the following 1) Generate count(visitid) and count(distinct visitorid) by browser 2) Generate count(events), count(visitid) and count(distinct visitorid) by pagename I have issues with the first task. I tried the below after flattening visit and it worked.
data = load 'c:/shared/visitevent' using PigStorage() AS (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, pagevar)}); data2 = foreach data generate FLATTEN(visit); data3 = group data2 by browser; dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate group, COUNT(d2), COUNT(d1);}; describe dc; dump dc; I don't understand why I would need to flatten visit. I tried the below without flattening and whatever I try it doesn't work. Not sure why.
data = load 'c:/shared/visitevent' using PigStorage() AS (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, pagevar)}); data2 = foreach data generate visit; data3 = group data2 by browser; # describe data3 produces below # data3: {group: bytearray,data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}} # none of the below work as somehow it doesn't find the alias. Why? dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate group, COUNT(d2), COUNT(d1);}; dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate group, COUNT(d2), COUNT(d1);};
What am I doing wrong? Since my task #2 is going to group by pagename which is in a bag->tuple, do I have to flatten that one twice to get this working? Are there any documentation on dereferencing complex and nested structures? Any help appreciated. Thanks Priyo
-
Re: Problem with dereferencing and alias
Gianmarco De Francisci Mo... 2012-04-23, 20:10
Hi,
the fact is that visit is a nested tuple inside the tuples that make your original relation. If you describe the data2 relation it should get clear:
WITH FLATTEN grunt> describe data2 data2: {visit::visitorid: bytearray,visit::visitid: bytearray,visit::browser: bytearray}
WITHOUT FLATTEN grunt> data2 = foreach data generate visit; grunt> describe data2 data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}
If you don't want to flatten (for whichever reason), you need to modify your script like this:
data3 = group data2 by visit.browser;
But then you have a double nesting which I find cumbersome to work with. grunt> describe data3 data3: {group: bytearray,data2: {(visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray))}}
Now you have data3 which is a bag with a nested bag data2 with a nested tuple which contains a 3 element tuple.
That's why flattening comes handy in this case.
I hope it helps.
Cheers, -- Gianmarco
On Mon, Apr 23, 2012 at 21:05, Mustafi, Priyo <[EMAIL PROTECTED]> wrote:
> Hi All, > I am pretty new to pig and am having some issues with dereferencing. My > data in simplified form looks like below > > data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid, > visitid, browser), events:bag{event:tuple(pagename, pagevar)}); > > cat visitevent (note there is tab in between the visit and the events) > (vr1,vi1,ff) {((pagea,eb1)),((pageb,eb3))} > (vr1,vi2,ff) {((pageb,eb2))} > (vr2,vi3,ff) {((pageb,eb4))} > (vr3,vi4,ie) {((pagec,eb3)),((pagea,eb5))} > > > My task is the following > 1) Generate count(visitid) and count(distinct visitorid) by browser > 2) Generate count(events), count(visitid) and count(distinct visitorid) > by pagename > > > I have issues with the first task. I tried the below after flattening > visit and it worked. > > data = load 'c:/shared/visitevent' using PigStorage() AS > (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, > pagevar)}); > data2 = foreach data generate FLATTEN(visit); > data3 = group data2 by browser; > dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > describe dc; > dump dc; > > > I don't understand why I would need to flatten visit. I tried the below > without flattening and whatever I try it doesn't work. Not sure why. > > data = load 'c:/shared/visitevent' using PigStorage() AS > (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, > pagevar)}); > data2 = foreach data generate visit; > data3 = group data2 by browser; > # describe data3 produces below > # data3: {group: bytearray,data2: {visit: (visitorid: > bytearray,visitid: bytearray,browser: bytearray)}} > # none of the below work as somehow it doesn't find the alias. Why? > dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > > What am I doing wrong? Since my task #2 is going to group by pagename > which is in a bag->tuple, do I have to flatten that one twice to get this > working? Are there any documentation on dereferencing complex and nested > structures? Any help appreciated. > > Thanks > Priyo > > > >
-
RE: Problem with dereferencing and alias
Mustafi, Priyo 2012-04-23, 20:50
Thanks Gianmarco! I see why it makes sense now. I guess when I see multiple levels of nesting, I should flatten for ease of processing. -----Original Message----- From: Gianmarco De Francisci Morales [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 1:10 PM To: [EMAIL PROTECTED] Subject: Re: Problem with dereferencing and alias
Hi,
the fact is that visit is a nested tuple inside the tuples that make your original relation. If you describe the data2 relation it should get clear:
WITH FLATTEN grunt> describe data2 data2: {visit::visitorid: bytearray,visit::visitid: bytearray,visit::browser: bytearray}
WITHOUT FLATTEN grunt> data2 = foreach data generate visit; grunt> describe data2 data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}
If you don't want to flatten (for whichever reason), you need to modify your script like this:
data3 = group data2 by visit.browser;
But then you have a double nesting which I find cumbersome to work with. grunt> describe data3 data3: {group: bytearray,data2: {(visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray))}}
Now you have data3 which is a bag with a nested bag data2 with a nested tuple which contains a 3 element tuple.
That's why flattening comes handy in this case.
I hope it helps.
Cheers, -- Gianmarco
On Mon, Apr 23, 2012 at 21:05, Mustafi, Priyo <[EMAIL PROTECTED]> wrote:
> Hi All, > I am pretty new to pig and am having some issues with dereferencing. My > data in simplified form looks like below > > data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid, > visitid, browser), events:bag{event:tuple(pagename, pagevar)}); > > cat visitevent (note there is tab in between the visit and the events) > (vr1,vi1,ff) {((pagea,eb1)),((pageb,eb3))} > (vr1,vi2,ff) {((pageb,eb2))} > (vr2,vi3,ff) {((pageb,eb4))} > (vr3,vi4,ie) {((pagec,eb3)),((pagea,eb5))} > > > My task is the following > 1) Generate count(visitid) and count(distinct visitorid) by browser > 2) Generate count(events), count(visitid) and count(distinct visitorid) > by pagename > > > I have issues with the first task. I tried the below after flattening > visit and it worked. > > data = load 'c:/shared/visitevent' using PigStorage() AS > (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, > pagevar)}); > data2 = foreach data generate FLATTEN(visit); > data3 = group data2 by browser; > dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > describe dc; > dump dc; > > > I don't understand why I would need to flatten visit. I tried the below > without flattening and whatever I try it doesn't work. Not sure why. > > data = load 'c:/shared/visitevent' using PigStorage() AS > (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, > pagevar)}); > data2 = foreach data generate visit; > data3 = group data2 by browser; > # describe data3 produces below > # data3: {group: bytearray,data2: {visit: (visitorid: > bytearray,visitid: bytearray,browser: bytearray)}} > # none of the below work as somehow it doesn't find the alias. Why? > dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > > What am I doing wrong? Since my task #2 is going to group by pagename > which is in a bag->tuple, do I have to flatten that one twice to get this > working? Are there any documentation on dereferencing complex and nested > structures? Any help appreciated. > > Thanks > Priyo > > > >
|
|