|
|
Steve Bernstein 2012-08-29, 23:27
Hi all, I have a bag, clickstreams: {clickStream: {pageName: chararray}}, for which each row represents a sequence of pages and events in a single session on a website. The interior bag, clickstream, represents this as a sequence of one or more single element tuples, e.g.,
{(homepage),(pg1),(pg2),...,(pgN)}
I'd like to group by the sequences so I can get counts and ultimately sort to find the most common clickstreams. A bag can't be a key for grouping, I've discovered, but it seems like it ought to be easy to flatten the clickstream bag into some other form such that the sequences can be used as keys for grouping. But I can't figure it out.
Any ideas?
Thanks! Steve
-
RE: group by clickstream
Steve Bernstein 2012-08-30, 16:06
Some clarification on the below. Ignore the outer bag, I'd removed some data elements for clarity and simplicity. Basically, I'm trying to find a way to go from:
{(pg),(pg),...,(pg)} to {(pg,pg,...,pg)}
For an abritrary number of "pg" tuples.
SB
-----Original Message----- From: Steve Bernstein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 29, 2012 4:28 PM To: [EMAIL PROTECTED] Subject: group by clickstream
Hi all, I have a bag, clickstreams: {clickStream: {pageName: chararray}}, for which each row represents a sequence of pages and events in a single session on a website. The interior bag, clickstream, represents this as a sequence of one or more single element tuples, e.g.,
{(homepage),(pg1),(pg2),...,(pgN)}
I'd like to group by the sequences so I can get counts and ultimately sort to find the most common clickstreams. A bag can't be a key for grouping, I've discovered, but it seems like it ought to be easy to flatten the clickstream bag into some other form such that the sequences can be used as keys for grouping. But I can't figure it out.
Any ideas?
Thanks! Steve
-
RE: group by clickstream
Steve Bernstein 2012-08-30, 16:22
Nvm, here's what I'll do, but if anyone has a better idea, please do tell.
I'll STORE the bag using PigStorage(';') to delimit the chararrays, then reload it with an appropriate schema, treating the page sequences as concatenated strings, then group and count by those. I can SPLIT() out the outer bag in advance of that to maintain isolation by the inner bag's sibling data elements.
SB -----Original Message----- From: Steve Bernstein [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 30, 2012 9:07 AM To: [EMAIL PROTECTED] Subject: RE: group by clickstream
Some clarification on the below. Ignore the outer bag, I'd removed some data elements for clarity and simplicity. Basically, I'm trying to find a way to go from:
{(pg),(pg),...,(pg)} to {(pg,pg,...,pg)}
For an abritrary number of "pg" tuples.
SB
-----Original Message----- From: Steve Bernstein [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 29, 2012 4:28 PM To: [EMAIL PROTECTED] Subject: group by clickstream
Hi all, I have a bag, clickstreams: {clickStream: {pageName: chararray}}, for which each row represents a sequence of pages and events in a single session on a website. The interior bag, clickstream, represents this as a sequence of one or more single element tuples, e.g.,
{(homepage),(pg1),(pg2),...,(pgN)}
I'd like to group by the sequences so I can get counts and ultimately sort to find the most common clickstreams. A bag can't be a key for grouping, I've discovered, but it seems like it ought to be easy to flatten the clickstream bag into some other form such that the sequences can be used as keys for grouping. But I can't figure it out.
Any ideas?
Thanks! Steve
-
Re: group by clickstream
=?KOI8-U?B?96bUwcymyiD0yc... 2012-08-31, 20:48
Hello.
Does not FLATTEN do exactly this?
Best regards, Vitalii Tymchyshyn
2012/8/30 Steve Bernstein <[EMAIL PROTECTED]>
> Some clarification on the below. Ignore the outer bag, I'd removed some > data elements for clarity and simplicity. Basically, I'm trying to find a > way to go from: > > {(pg),(pg),...,(pg)} > to > {(pg,pg,...,pg)} > > For an abritrary number of "pg" tuples. > > SB > > -----Original Message----- > From: Steve Bernstein [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 29, 2012 4:28 PM > To: [EMAIL PROTECTED] > Subject: group by clickstream > > Hi all, > I have a bag, clickstreams: {clickStream: {pageName: chararray}}, for > which each row represents a sequence of pages and events in a single > session on a website. The interior bag, clickstream, represents this as a > sequence of one or more single element tuples, e.g., > > {(homepage),(pg1),(pg2),...,(pgN)} > > I'd like to group by the sequences so I can get counts and ultimately sort > to find the most common clickstreams. A bag can't be a key for grouping, > I've discovered, but it seems like it ought to be easy to flatten the > clickstream bag into some other form such that the sequences can be used as > keys for grouping. But I can't figure it out. > > Any ideas? > > Thanks! > Steve > > -- Best regards, Vitalii Tymchyshyn
-
RE: group by clickstream
Steve Bernstein 2012-08-31, 23:14
Nope, tried that, it breaks it back into one tuple per record...not what I want.
-----Original Message----- From: Віталій Тимчишин [mailto:[EMAIL PROTECTED]] Sent: Friday, August 31, 2012 1:49 PM To: [EMAIL PROTECTED] Subject: Re: group by clickstream
Hello.
Does not FLATTEN do exactly this?
Best regards, Vitalii Tymchyshyn
2012/8/30 Steve Bernstein <[EMAIL PROTECTED]>
> Some clarification on the below. Ignore the outer bag, I'd removed > some data elements for clarity and simplicity. Basically, I'm trying > to find a way to go from: > > {(pg),(pg),...,(pg)} > to > {(pg,pg,...,pg)} > > For an abritrary number of "pg" tuples. > > SB > > -----Original Message----- > From: Steve Bernstein [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 29, 2012 4:28 PM > To: [EMAIL PROTECTED] > Subject: group by clickstream > > Hi all, > I have a bag, clickstreams: {clickStream: {pageName: chararray}}, for > which each row represents a sequence of pages and events in a single > session on a website. The interior bag, clickstream, represents this > as a sequence of one or more single element tuples, e.g., > > {(homepage),(pg1),(pg2),...,(pgN)} > > I'd like to group by the sequences so I can get counts and ultimately > sort to find the most common clickstreams. A bag can't be a key for > grouping, I've discovered, but it seems like it ought to be easy to > flatten the clickstream bag into some other form such that the > sequences can be used as keys for grouping. But I can't figure it out. > > Any ideas? > > Thanks! > Steve > > -- Best regards, Vitalii Tymchyshyn
|
|