Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Cumulative totals in an ORDERed relation.


Copy link to this message
-
Re: Cumulative totals in an ORDERed relation.
Kris Coward 2010-12-19, 22:49

Well for the step you're describing (which I need to do as a preliminary
step to accumulating the hours), I just do something in the vein of

NewRel = GROUP OldRel BY timestamp/3600;
HourlyRel = FOREACH NewRel GENERATE group as hour, OldRel.something AS something,...;

(Noting that timestamp is stored as a long, so I get integer division
and the GROUP does what's wanted)

Dmitriy was right both about what I was trying to to, and that it's an
inherently serial operation.

Thanks,
Kris

On Fri, Dec 17, 2010 at 06:32:38PM -0500, Zach Bailey wrote:
>
>  I believe what you're trying to do is this. You have some sort of data, and a timestamp:
>
>
> What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly).
>
>
> Let's say data can have three possible string values: {'a', 'b', 'c'}
>
>
> Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them).
>
>
> To accumulate all the times that the data 'a' appeared in an hour you would do something like this:
>
>
> --register piggybank.jar for iso date functions
> REGISTER ./piggybank.jar
> allData = load ... as (string:chararray, ts:long);
> --convert ts to ISO Date, and truncate to the hour
> allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour;
> -- group by hour and string
> groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> -- append counts
> stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
>
>
> You will now have a relation that looks like:
> {'a', '2010-12-13T12:00:00', 2334}
> {'b', '2010-12-13T12:00:00', 123}
> {'c', '2010-12-13T12:00:00', 3}
> {'a', '2010-12-13T13:00:00', 34231}
> {'b', '2010-12-13T13:00:00', 34}
> {'c', '2010-12-13T13:00:00', 134}
>
>
> Is that the sort of thing you're looking to do?
>
> -Zach
>
>
> On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
>
> > What you are suggesting seems to be a fundamentally single-threaded process
> > (well, it can be parallelized, but it's not pretty and involves multiple
> > passes), so it's not a good fit for the map-reduce paradigm (how would you
> > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > implementing methods that restrict scaling computations in this way. Your
> > idea of streaming through a script would work; you could also write an
> > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > relation.
> >
> > -Dmitriy
> >
> > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote:
> >
> >
> > >  Hello,
> > >
> > >  Is there some sort of mechanism by which I could cause a value to
> > >  accumulate within a relation? What I'd like to do is something along the
> > >  lines of having a long called accumulator, and an outer bag called
> > >  hourlyTotals with a schema of (hour:int, collected:int)
> > >
> > >  accumulator = 0L; -- I know this line doesn't work
> > >  ORDER hourlyTotals BY collected;
> > >  cumulativeTotals = FOREACH hourlyTotals {
> > >  accumulator += collected;
> > >  GENERATE day, accumulator AS collected;
> > >  }
> > >
> > >  Could something like this be made to work? Is there something similar that
> > >  I can do instead? Do I just need to pipe the relation through an
> > >  external script to get what I want?
> > >
> > >  Thanks,
> > >  Kris
> > >
> > >  --
> > >  Kris Coward http://unripe.melon.org/
> > >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>

--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3