Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Cumulative totals in an ORDERed relation.


Copy link to this message
-
Re: Cumulative totals in an ORDERed relation.

Well for the step you're describing (which I need to do as a preliminary
step to accumulating the hours), I just do something in the vein of

NewRel = GROUP OldRel BY timestamp/3600;
HourlyRel = FOREACH NewRel GENERATE group as hour, OldRel.something AS something,...;

(Noting that timestamp is stored as a long, so I get integer division
and the GROUP does what's wanted)

Dmitriy was right both about what I was trying to to, and that it's an
inherently serial operation.

Thanks,
Kris

On Fri, Dec 17, 2010 at 06:32:38PM -0500, Zach Bailey wrote:
>
>  I believe what you're trying to do is this. You have some sort of data, and a timestamp:
>
>
> What you want to figure out is how many times each possible value of "data" appears in a certain time period (say, hourly).
>
>
> Let's say data can have three possible string values: {'a', 'b', 'c'}
>
>
> Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted date (I would strongly recommend using one of these since there are already piggybank functions to slice and dice them).
>
>
> To accumulate all the times that the data 'a' appeared in an hour you would do something like this:
>
>
> --register piggybank.jar for iso date functions
> REGISTER ./piggybank.jar
> allData = load ... as (string:chararray, ts:long);
> --convert ts to ISO Date, and truncate to the hour
> allDataISODates = FOREACH allData GENERATE string, org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts)) as isoHour;
> -- group by hour and string
> groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> -- append counts
> stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
>
>
> You will now have a relation that looks like:
> {'a', '2010-12-13T12:00:00', 2334}
> {'b', '2010-12-13T12:00:00', 123}
> {'c', '2010-12-13T12:00:00', 3}
> {'a', '2010-12-13T13:00:00', 34231}
> {'b', '2010-12-13T13:00:00', 34}
> {'c', '2010-12-13T13:00:00', 134}
>
>
> Is that the sort of thing you're looking to do?
>
> -Zach
>
>
> On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
>
> > What you are suggesting seems to be a fundamentally single-threaded process
> > (well, it can be parallelized, but it's not pretty and involves multiple
> > passes), so it's not a good fit for the map-reduce paradigm (how would you
> > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > implementing methods that restrict scaling computations in this way. Your
> > idea of streaming through a script would work; you could also write an
> > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > relation.
> >
> > -Dmitriy
> >
> > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote:
> >
> >
> > >  Hello,
> > >
> > >  Is there some sort of mechanism by which I could cause a value to
> > >  accumulate within a relation? What I'd like to do is something along the
> > >  lines of having a long called accumulator, and an outer bag called
> > >  hourlyTotals with a schema of (hour:int, collected:int)
> > >
> > >  accumulator = 0L; -- I know this line doesn't work
> > >  ORDER hourlyTotals BY collected;
> > >  cumulativeTotals = FOREACH hourlyTotals {
> > >  accumulator += collected;
> > >  GENERATE day, accumulator AS collected;
> > >  }
> > >
> > >  Could something like this be made to work? Is there something similar that
> > >  I can do instead? Do I just need to pipe the relation through an
> > >  external script to get what I want?
> > >
> > >  Thanks,
> > >  Kris
> > >
> > >  --
> > >  Kris Coward http://unripe.melon.org/
> > >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>

--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB