|
|
-
Cumulative totals in an ORDERed relation.
Kris Coward 2010-12-17, 19:31
Hello, Is there some sort of mechanism by which I could cause a value to accumulate within a relation? What I'd like to do is something along the lines of having a long called accumulator, and an outer bag called hourlyTotals with a schema of (hour:int, collected:int) accumulator = 0L; -- I know this line doesn't work ORDER hourlyTotals BY collected; cumulativeTotals = FOREACH hourlyTotals { accumulator += collected; GENERATE day, accumulator AS collected; } Could something like this be made to work? Is there something similar that I can do instead? Do I just need to pipe the relation through an external script to get what I want? Thanks, Kris -- Kris Coward http://unripe.melon.org/GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
-
Re: Cumulative totals in an ORDERed relation.
Dmitriy Ryaboy 2010-12-17, 23:22
What you are suggesting seems to be a fundamentally single-threaded process (well, it can be parallelized, but it's not pretty and involves multiple passes), so it's not a good fit for the map-reduce paradigm (how would you do accumulative totals for 25 billion entries?). Pig tends to avoid implementing methods that restrict scaling computations in this way. Your idea of streaming through a script would work; you could also write an accumulative UDF and use it on the result of doing a GROUP ALL on your relation. -Dmitriy On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote: > Hello, > > Is there some sort of mechanism by which I could cause a value to > accumulate within a relation? What I'd like to do is something along the > lines of having a long called accumulator, and an outer bag called > hourlyTotals with a schema of (hour:int, collected:int) > > accumulator = 0L; -- I know this line doesn't work > ORDER hourlyTotals BY collected; > cumulativeTotals = FOREACH hourlyTotals { > accumulator += collected; > GENERATE day, accumulator AS collected; > } > > Could something like this be made to work? Is there something similar that > I can do instead? Do I just need to pipe the relation through an > external script to get what I want? > > Thanks, > Kris > > -- > Kris Coward http://unripe.melon.org/> GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 >
-
Re: Cumulative totals in an ORDERed relation.
Zach Bailey 2010-12-17, 23:32
I believe what you're trying to do is this. You have some sort of data, and a timestamp: What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly). Let's say data can have three possible string values: {'a', 'b', 'c'} Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them). To accumulate all the times that the data 'a' appeared in an hour you would do something like this: --register piggybank.jar for iso date functions REGISTER ./piggybank.jar allData = load ... as (string:chararray, ts:long); --convert ts to ISO Date, and truncate to the hour allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour; -- group by hour and string groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour); -- append counts stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count; You will now have a relation that looks like: {'a', '2010-12-13T12:00:00', 2334} {'b', '2010-12-13T12:00:00', 123} {'c', '2010-12-13T12:00:00', 3} {'a', '2010-12-13T13:00:00', 34231} {'b', '2010-12-13T13:00:00', 34} {'c', '2010-12-13T13:00:00', 134} Is that the sort of thing you're looking to do? -Zach On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote: > What you are suggesting seems to be a fundamentally single-threaded process > (well, it can be parallelized, but it's not pretty and involves multiple > passes), so it's not a good fit for the map-reduce paradigm (how would you > do accumulative totals for 25 billion entries?). Pig tends to avoid > implementing methods that restrict scaling computations in this way. Your > idea of streaming through a script would work; you could also write an > accumulative UDF and use it on the result of doing a GROUP ALL on your > relation. > > -Dmitriy > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote: > > > > Hello, > > > > Is there some sort of mechanism by which I could cause a value to > > accumulate within a relation? What I'd like to do is something along the > > lines of having a long called accumulator, and an outer bag called > > hourlyTotals with a schema of (hour:int, collected:int) > > > > accumulator = 0L; -- I know this line doesn't work > > ORDER hourlyTotals BY collected; > > cumulativeTotals = FOREACH hourlyTotals { > > accumulator += collected; > > GENERATE day, accumulator AS collected; > > } > > > > Could something like this be made to work? Is there something similar that > > I can do instead? Do I just need to pipe the relation through an > > external script to get what I want? > > > > Thanks, > > Kris > > > > -- > > Kris Coward http://unripe.melon.org/> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > > > > > > > > > >
-
Re: Cumulative totals in an ORDERed relation.
Zach Bailey 2010-12-17, 23:36
Forgive me but I got one thing slightly wrong. Since you're wanting to do hourly totals and not daily totals you will want to change this line: > allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour; > > > > to this: allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour; Of course I just illustrated how easy it is to swap in different piggybank functions to do different statistical roll-ups depending on what sort of temporal granularity you need. Huzzah! Happy pigging, Zach On Friday, December 17, 2010 at 6:32 PM, Zach Bailey wrote: > > I believe what you're trying to do is this. You have some sort of data, and a timestamp: > > > What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly). > > > Let's say data can have three possible string values: {'a', 'b', 'c'} > > > Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them). > > > To accumulate all the times that the data 'a' appeared in an hour you would do something like this: > > > --register piggybank.jar for iso date functions > REGISTER ./piggybank.jar > allData = load ... as (string:chararray, ts:long); > --convert ts to ISO Date, and truncate to the hour > allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour; > -- group by hour and string > groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour); > -- append counts > stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count; > > > You will now have a relation that looks like: > {'a', '2010-12-13T12:00:00', 2334} > {'b', '2010-12-13T12:00:00', 123} > {'c', '2010-12-13T12:00:00', 3} > {'a', '2010-12-13T13:00:00', 34231} > {'b', '2010-12-13T13:00:00', 34} > {'c', '2010-12-13T13:00:00', 134} > > > Is that the sort of thing you're looking to do? > > -Zach > > > On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote: > > > What you are suggesting seems to be a fundamentally single-threaded process > > (well, it can be parallelized, but it's not pretty and involves multiple > > passes), so it's not a good fit for the map-reduce paradigm (how would you > > do accumulative totals for 25 billion entries?). Pig tends to avoid > > implementing methods that restrict scaling computations in this way. Your > > idea of streaming through a script would work; you could also write an > > accumulative UDF and use it on the result of doing a GROUP ALL on your > > relation. > > > > -Dmitriy > > > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote: > > > > > > > Hello, > > > > > > Is there some sort of mechanism by which I could cause a value to > > > accumulate within a relation? What I'd like to do is something along the > > > lines of having a long called accumulator, and an outer bag called > > > hourlyTotals with a schema of (hour:int, collected:int) > > > > > > accumulator = 0L; -- I know this line doesn't work > > > ORDER hourlyTotals BY collected; > > > cumulativeTotals = FOREACH hourlyTotals { > > > accumulator += collected; > > > GENERATE day, accumulator AS collected; > > > } > > > > > > Could something like this be made to work? Is there something similar that > > > I can do instead? Do I just need to pipe the relation through an > > > external script to get what I want? > > > > > > Thanks, > > > Kris > > > > > > -- > > > Kris Coward d" http:="" unripe.melon.org"=""> http://unripe.melon.org/
-
Re: Cumulative totals in an ORDERed relation.
Dmitriy Ryaboy 2010-12-18, 00:21
My interpretation was that he wants something more like this:
in: {2, 5, 7, 1, 1, 3} out: {2, 7, 14, 15, 16, 19}
.. which you can't get using a simple group/count.
-D
On Fri, Dec 17, 2010 at 3:36 PM, Zach Bailey <[EMAIL PROTECTED]>wrote:
> > Forgive me but I got one thing slightly wrong. Since you're wanting to do > hourly totals and not daily totals you will want to change this line: > > > allDataISODates = FOREACH allData GENERATE string, > org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) > as isoHour; > > > > > > > > > to this: > > > allDataISODates = FOREACH allData GENERATE string, > org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) > as isoHour; > > > Of course I just illustrated how easy it is to swap in different piggybank > functions to do different statistical roll-ups depending on what sort of > temporal granularity you need. Huzzah! > > Happy pigging, > Zach > > > On Friday, December 17, 2010 at 6:32 PM, Zach Bailey wrote: > > > > > I believe what you're trying to do is this. You have some sort of data, > and a timestamp: > > > > > > What you want to figure out is how many times each possible value of > "data" appears in a certain time period (say, hourly). > > > > > > Let's say data can have three possible string values: {'a', 'b', 'c'} > > > > > > Your timestamp for convenience sake is a Unix UTC timestamp or ISO > formatted date (I would strongly recommend using one of these since there > are already piggybank functions to slice and dice them). > > > > > > To accumulate all the times that the data 'a' appeared in an hour you > would do something like this: > > > > > > --register piggybank.jar for iso date functions > > REGISTER ./piggybank.jar > > allData = load ... as (string:chararray, ts:long); > > --convert ts to ISO Date, and truncate to the hour > > allDataISODates = FOREACH allData GENERATE string, > org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) > as isoHour; > > -- group by hour and string > > groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour); > > -- append counts > > stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string > as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count; > > > > > > You will now have a relation that looks like: > > {'a', '2010-12-13T12:00:00', 2334} > > {'b', '2010-12-13T12:00:00', 123} > > {'c', '2010-12-13T12:00:00', 3} > > {'a', '2010-12-13T13:00:00', 34231} > > {'b', '2010-12-13T13:00:00', 34} > > {'c', '2010-12-13T13:00:00', 134} > > > > > > Is that the sort of thing you're looking to do? > > > > -Zach > > > > > > On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote: > > > > > What you are suggesting seems to be a fundamentally single-threaded > process > > > (well, it can be parallelized, but it's not pretty and involves > multiple > > > passes), so it's not a good fit for the map-reduce paradigm (how would > you > > > do accumulative totals for 25 billion entries?). Pig tends to avoid > > > implementing methods that restrict scaling computations in this way. > Your > > > idea of streaming through a script would work; you could also write an > > > accumulative UDF and use it on the result of doing a GROUP ALL on your > > > relation. > > > > > > -Dmitriy > > > > > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote: > > > > > > > > > > Hello, > > > > > > > > Is there some sort of mechanism by which I could cause a value to > > > > accumulate within a relation? What I'd like to do is something along > the > > > > lines of having a long called accumulator, and an outer bag called > > > > hourlyTotals with a schema of (hour:int, collected:int) > > > > > > > > accumulator = 0L; -- I know this line doesn't work > > > > ORDER hourlyTotals BY collected;
-
Re: Cumulative totals in an ORDERed relation.
Kris Coward 2010-12-19, 22:42
Right, that's a good point, it is a non-parallelizable process. I probably should just dump it through a script, since even an entire century of data would be <1M hours and not really need to take advantage of the cluster. ISTR there's some pretty good functionality for that, so I just need to look it up in the documentation again. Thanks, Kris On Fri, Dec 17, 2010 at 03:22:53PM -0800, Dmitriy Ryaboy wrote: > What you are suggesting seems to be a fundamentally single-threaded process > (well, it can be parallelized, but it's not pretty and involves multiple > passes), so it's not a good fit for the map-reduce paradigm (how would you > do accumulative totals for 25 billion entries?). Pig tends to avoid > implementing methods that restrict scaling computations in this way. Your > idea of streaming through a script would work; you could also write an > accumulative UDF and use it on the result of doing a GROUP ALL on your > relation. > > -Dmitriy > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > Is there some sort of mechanism by which I could cause a value to > > accumulate within a relation? What I'd like to do is something along the > > lines of having a long called accumulator, and an outer bag called > > hourlyTotals with a schema of (hour:int, collected:int) > > > > accumulator = 0L; -- I know this line doesn't work > > ORDER hourlyTotals BY collected; > > cumulativeTotals = FOREACH hourlyTotals { > > accumulator += collected; > > GENERATE day, accumulator AS collected; > > } > > > > Could something like this be made to work? Is there something similar that > > I can do instead? Do I just need to pipe the relation through an > > external script to get what I want? > > > > Thanks, > > Kris > > > > -- > > Kris Coward http://unripe.melon.org/> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > >
-
Re: Cumulative totals in an ORDERed relation.
Kris Coward 2010-12-19, 22:49
Well for the step you're describing (which I need to do as a preliminary step to accumulating the hours), I just do something in the vein of NewRel = GROUP OldRel BY timestamp/3600; HourlyRel = FOREACH NewRel GENERATE group as hour, OldRel.something AS something,...; (Noting that timestamp is stored as a long, so I get integer division and the GROUP does what's wanted) Dmitriy was right both about what I was trying to to, and that it's an inherently serial operation. Thanks, Kris On Fri, Dec 17, 2010 at 06:32:38PM -0500, Zach Bailey wrote: > > I believe what you're trying to do is this. You have some sort of data, and a timestamp: > > > What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly). > > > Let's say data can have three possible string values: {'a', 'b', 'c'} > > > Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them). > > > To accumulate all the times that the data 'a' appeared in an hour you would do something like this: > > > --register piggybank.jar for iso date functions > REGISTER ./piggybank.jar > allData = load ... as (string:chararray, ts:long); > --convert ts to ISO Date, and truncate to the hour > allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour; > -- group by hour and string > groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour); > -- append counts > stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count; > > > You will now have a relation that looks like: > {'a', '2010-12-13T12:00:00', 2334} > {'b', '2010-12-13T12:00:00', 123} > {'c', '2010-12-13T12:00:00', 3} > {'a', '2010-12-13T13:00:00', 34231} > {'b', '2010-12-13T13:00:00', 34} > {'c', '2010-12-13T13:00:00', 134} > > > Is that the sort of thing you're looking to do? > > -Zach > > > On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote: > > > What you are suggesting seems to be a fundamentally single-threaded process > > (well, it can be parallelized, but it's not pretty and involves multiple > > passes), so it's not a good fit for the map-reduce paradigm (how would you > > do accumulative totals for 25 billion entries?). Pig tends to avoid > > implementing methods that restrict scaling computations in this way. Your > > idea of streaming through a script would work; you could also write an > > accumulative UDF and use it on the result of doing a GROUP ALL on your > > relation. > > > > -Dmitriy > > > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote: > > > > > > > Hello, > > > > > > Is there some sort of mechanism by which I could cause a value to > > > accumulate within a relation? What I'd like to do is something along the > > > lines of having a long called accumulator, and an outer bag called > > > hourlyTotals with a schema of (hour:int, collected:int) > > > > > > accumulator = 0L; -- I know this line doesn't work > > > ORDER hourlyTotals BY collected; > > > cumulativeTotals = FOREACH hourlyTotals { > > > accumulator += collected; > > > GENERATE day, accumulator AS collected; > > > } > > > > > > Could something like this be made to work? Is there something similar that > > > I can do instead? Do I just need to pipe the relation through an > > > external script to get what I want? > > > > > > Thanks, > > > Kris > > > > > > -- > > > Kris Coward http://unripe.melon.org/> > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > > > > > > > > > > > > > > > > > > > > -- Kris Coward http://unripe.melon.org/GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
|
|