|
|
Grig Gheorghiu 2012-01-27, 00:02
Let's say I have this dataset:
1,undefined,text1 1,,text2 1,event1,text3 1,undefined,text4 1,event2,text5 1,event3,text6
I would like to group by 1st value, but not quite an ordinary grouping. I would like all lines that contain either an empty value or 'undefined' on the 2nd position to be rolled up in the first line that contains a proper value in the 2nd position. So basically I'd like to obtain this relation:
(1,event1,3) (1,event2,2) (1,event3,1)
(where the 3rd value is the count of lines that were seen before a proper 'event' line was seen).
Is this possible with Pig?
Thanks!
Grig
-
Re: Non-standard grouping
Prashant Kommireddi 2012-01-27, 00:06
What is the last field in your output?
(1,event1,3) (1,event2,2) (1,event3,1)
On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <[EMAIL PROTECTED]>wrote:
> Let's say I have this dataset: > > 1,undefined,text1 > 1,,text2 > 1,event1,text3 > 1,undefined,text4 > 1,event2,text5 > 1,event3,text6 > > I would like to group by 1st value, but not quite an ordinary > grouping. I would like all lines that contain either an empty value or > 'undefined' on the 2nd position to be rolled up in the first line that > contains a proper value in the 2nd position. So basically I'd like to > obtain this relation: > > (1,event1,3) > (1,event2,2) > (1,event3,1) > > (where the 3rd value is the count of lines that were seen before a > proper 'event' line was seen). > > Is this possible with Pig? > > Thanks! > > Grig >
-
Re: Non-standard grouping
Grig Gheorghiu 2012-01-27, 00:08
The count of lines seen up to and including a proper event value (3 lines for event1, 2 for event2, 1 for event3).
On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: > What is the last field in your output? > > (1,event1,3) > (1,event2,2) > (1,event3,1) > > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <[EMAIL PROTECTED]>wrote: > >> Let's say I have this dataset: >> >> 1,undefined,text1 >> 1,,text2 >> 1,event1,text3 >> 1,undefined,text4 >> 1,event2,text5 >> 1,event3,text6 >> >> I would like to group by 1st value, but not quite an ordinary >> grouping. I would like all lines that contain either an empty value or >> 'undefined' on the 2nd position to be rolled up in the first line that >> contains a proper value in the 2nd position. So basically I'd like to >> obtain this relation: >> >> (1,event1,3) >> (1,event2,2) >> (1,event3,1) >> >> (where the 3rd value is the count of lines that were seen before a >> proper 'event' line was seen). >> >> Is this possible with Pig? >> >> Thanks! >> >> Grig >>
-
Re: Non-standard grouping
Prashant Kommireddi 2012-01-27, 00:24
Grig, I am afraid there is nothing built into Pig to do this.
On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <[EMAIL PROTECTED]>wrote:
> The count of lines seen up to and including a proper event value (3 > lines for event1, 2 for event2, 1 for event3). > > On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi > <[EMAIL PROTECTED]> wrote: > > What is the last field in your output? > > > > (1,event1,3) > > (1,event2,2) > > (1,event3,1) > > > > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu < > [EMAIL PROTECTED]>wrote: > > > >> Let's say I have this dataset: > >> > >> 1,undefined,text1 > >> 1,,text2 > >> 1,event1,text3 > >> 1,undefined,text4 > >> 1,event2,text5 > >> 1,event3,text6 > >> > >> I would like to group by 1st value, but not quite an ordinary > >> grouping. I would like all lines that contain either an empty value or > >> 'undefined' on the 2nd position to be rolled up in the first line that > >> contains a proper value in the 2nd position. So basically I'd like to > >> obtain this relation: > >> > >> (1,event1,3) > >> (1,event2,2) > >> (1,event3,1) > >> > >> (where the 3rd value is the count of lines that were seen before a > >> proper 'event' line was seen). > >> > >> Is this possible with Pig? > >> > >> Thanks! > >> > >> Grig > >> >
-
Re: Non-standard grouping
Grig Gheorghiu 2012-01-27, 00:32
Could you even do it with an UDF? In a regular programming language you can easily do it with a sentinel that you keep track of, but in Pig I can't figure it out....
On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote: > Grig, I am afraid there is nothing built into Pig to do this. > > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <[EMAIL PROTECTED]>wrote: > >> The count of lines seen up to and including a proper event value (3 >> lines for event1, 2 for event2, 1 for event3). >> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi >> <[EMAIL PROTECTED]> wrote: >> > What is the last field in your output? >> > >> > (1,event1,3) >> > (1,event2,2) >> > (1,event3,1) >> > >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu < >> [EMAIL PROTECTED]>wrote: >> > >> >> Let's say I have this dataset: >> >> >> >> 1,undefined,text1 >> >> 1,,text2 >> >> 1,event1,text3 >> >> 1,undefined,text4 >> >> 1,event2,text5 >> >> 1,event3,text6 >> >> >> >> I would like to group by 1st value, but not quite an ordinary >> >> grouping. I would like all lines that contain either an empty value or >> >> 'undefined' on the 2nd position to be rolled up in the first line that >> >> contains a proper value in the 2nd position. So basically I'd like to >> >> obtain this relation: >> >> >> >> (1,event1,3) >> >> (1,event2,2) >> >> (1,event3,1) >> >> >> >> (where the 3rd value is the count of lines that were seen before a >> >> proper 'event' line was seen). >> >> >> >> Is this possible with Pig? >> >> >> >> Thanks! >> >> >> >> Grig >> >> >>
-
Re: Non-standard grouping
Dmitriy Ryaboy 2012-02-02, 23:43
"records before" is kind of hard do define in an MR paradigm. I suppose you could group and then run the records through an accumulative UDF. But this is feeling very hacky. Is there a more scalable (order-independent) way you can do what you need?
On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <[EMAIL PROTECTED]>wrote:
> Could you even do it with an UDF? In a regular programming language > you can easily do it with a sentinel that you keep track of, but in > Pig I can't figure it out.... > > On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi > <[EMAIL PROTECTED]> wrote: > > Grig, I am afraid there is nothing built into Pig to do this. > > > > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu < > [EMAIL PROTECTED]>wrote: > > > >> The count of lines seen up to and including a proper event value (3 > >> lines for event1, 2 for event2, 1 for event3). > >> > >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi > >> <[EMAIL PROTECTED]> wrote: > >> > What is the last field in your output? > >> > > >> > (1,event1,3) > >> > (1,event2,2) > >> > (1,event3,1) > >> > > >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu < > >> [EMAIL PROTECTED]>wrote: > >> > > >> >> Let's say I have this dataset: > >> >> > >> >> 1,undefined,text1 > >> >> 1,,text2 > >> >> 1,event1,text3 > >> >> 1,undefined,text4 > >> >> 1,event2,text5 > >> >> 1,event3,text6 > >> >> > >> >> I would like to group by 1st value, but not quite an ordinary > >> >> grouping. I would like all lines that contain either an empty value > or > >> >> 'undefined' on the 2nd position to be rolled up in the first line > that > >> >> contains a proper value in the 2nd position. So basically I'd like to > >> >> obtain this relation: > >> >> > >> >> (1,event1,3) > >> >> (1,event2,2) > >> >> (1,event3,1) > >> >> > >> >> (where the 3rd value is the count of lines that were seen before a > >> >> proper 'event' line was seen). > >> >> > >> >> Is this possible with Pig? > >> >> > >> >> Thanks! > >> >> > >> >> Grig > >> >> > >> >
-
Re: Non-standard grouping
Grig Gheorghiu 2012-02-02, 23:45
Hey Dmitriy! Unfortunately that't the requirement. The solution I found so far is to do all the pre-filtering and grouping I can in Pig, and then run Python on the output file generated by Pig. That file is ~ 300 MB, so it's not a problem to just run through Python.
Thanks for getting back to me.
Grig
On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > "records before" is kind of hard do define in an MR paradigm. > I suppose you could group and then run the records through an accumulative > UDF. But this is feeling very hacky. Is there a more scalable > (order-independent) way you can do what you need? > > On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <[EMAIL PROTECTED]>wrote: > >> Could you even do it with an UDF? In a regular programming language >> you can easily do it with a sentinel that you keep track of, but in >> Pig I can't figure it out.... >> >> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi >> <[EMAIL PROTECTED]> wrote: >> > Grig, I am afraid there is nothing built into Pig to do this. >> > >> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu < >> [EMAIL PROTECTED]>wrote: >> > >> >> The count of lines seen up to and including a proper event value (3 >> >> lines for event1, 2 for event2, 1 for event3). >> >> >> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi >> >> <[EMAIL PROTECTED]> wrote: >> >> > What is the last field in your output? >> >> > >> >> > (1,event1,3) >> >> > (1,event2,2) >> >> > (1,event3,1) >> >> > >> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu < >> >> [EMAIL PROTECTED]>wrote: >> >> > >> >> >> Let's say I have this dataset: >> >> >> >> >> >> 1,undefined,text1 >> >> >> 1,,text2 >> >> >> 1,event1,text3 >> >> >> 1,undefined,text4 >> >> >> 1,event2,text5 >> >> >> 1,event3,text6 >> >> >> >> >> >> I would like to group by 1st value, but not quite an ordinary >> >> >> grouping. I would like all lines that contain either an empty value >> or >> >> >> 'undefined' on the 2nd position to be rolled up in the first line >> that >> >> >> contains a proper value in the 2nd position. So basically I'd like to >> >> >> obtain this relation: >> >> >> >> >> >> (1,event1,3) >> >> >> (1,event2,2) >> >> >> (1,event3,1) >> >> >> >> >> >> (where the 3rd value is the count of lines that were seen before a >> >> >> proper 'event' line was seen). >> >> >> >> >> >> Is this possible with Pig? >> >> >> >> >> >> Thanks! >> >> >> >> >> >> Grig >> >> >> >> >> >>
-
Re: Non-standard grouping
Dmitriy Ryaboy 2012-02-03, 03:35
Ah, yeah, if you can shrink data down that much, going outside of Pig (or doing things in a UDF) is the way to go.
D
On Thu, Feb 2, 2012 at 3:45 PM, Grig Gheorghiu <[EMAIL PROTECTED]>wrote:
> Hey Dmitriy! Unfortunately that't the requirement. The solution I > found so far is to do all the pre-filtering and grouping I can in Pig, > and then run Python on the output file generated by Pig. That file is > ~ 300 MB, so it's not a problem to just run through Python. > > Thanks for getting back to me. > > Grig > > On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > "records before" is kind of hard do define in an MR paradigm. > > I suppose you could group and then run the records through an > accumulative > > UDF. But this is feeling very hacky. Is there a more scalable > > (order-independent) way you can do what you need? > > > > On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu < > [EMAIL PROTECTED]>wrote: > > > >> Could you even do it with an UDF? In a regular programming language > >> you can easily do it with a sentinel that you keep track of, but in > >> Pig I can't figure it out.... > >> > >> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi > >> <[EMAIL PROTECTED]> wrote: > >> > Grig, I am afraid there is nothing built into Pig to do this. > >> > > >> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu < > >> [EMAIL PROTECTED]>wrote: > >> > > >> >> The count of lines seen up to and including a proper event value (3 > >> >> lines for event1, 2 for event2, 1 for event3). > >> >> > >> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi > >> >> <[EMAIL PROTECTED]> wrote: > >> >> > What is the last field in your output? > >> >> > > >> >> > (1,event1,3) > >> >> > (1,event2,2) > >> >> > (1,event3,1) > >> >> > > >> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu < > >> >> [EMAIL PROTECTED]>wrote: > >> >> > > >> >> >> Let's say I have this dataset: > >> >> >> > >> >> >> 1,undefined,text1 > >> >> >> 1,,text2 > >> >> >> 1,event1,text3 > >> >> >> 1,undefined,text4 > >> >> >> 1,event2,text5 > >> >> >> 1,event3,text6 > >> >> >> > >> >> >> I would like to group by 1st value, but not quite an ordinary > >> >> >> grouping. I would like all lines that contain either an empty > value > >> or > >> >> >> 'undefined' on the 2nd position to be rolled up in the first line > >> that > >> >> >> contains a proper value in the 2nd position. So basically I'd > like to > >> >> >> obtain this relation: > >> >> >> > >> >> >> (1,event1,3) > >> >> >> (1,event2,2) > >> >> >> (1,event3,1) > >> >> >> > >> >> >> (where the 3rd value is the count of lines that were seen before a > >> >> >> proper 'event' line was seen). > >> >> >> > >> >> >> Is this possible with Pig? > >> >> >> > >> >> >> Thanks! > >> >> >> > >> >> >> Grig > >> >> >> > >> >> > >> >
|