Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Cumulative totals in an ORDERed relation.


Copy link to this message
-
Re: Cumulative totals in an ORDERed relation.

Right, that's a good point, it is a non-parallelizable process. I
probably should just dump it through a script, since even an entire
century of data would be <1M hours and not really need to take advantage
of the cluster. ISTR there's some pretty good functionality for that, so
I just need to look it up in the documentation again.

Thanks,
Kris

On Fri, Dec 17, 2010 at 03:22:53PM -0800, Dmitriy Ryaboy wrote:
> What you are suggesting seems to be a fundamentally single-threaded process
> (well, it can be parallelized, but it's not pretty and involves multiple
> passes), so it's not a good fit for the map-reduce paradigm (how would you
> do accumulative totals for 25 billion entries?).  Pig tends to avoid
> implementing methods that restrict scaling computations in this way. Your
> idea of streaming through a script would work; you could also write an
> accumulative UDF and use it on the result of doing a GROUP ALL on your
> relation.
>
> -Dmitriy
>
> On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[EMAIL PROTECTED]> wrote:
>
> > Hello,
> >
> > Is there some sort of mechanism by which I could cause a value to
> > accumulate within a relation? What I'd like to do is something along the
> > lines of having a long called accumulator, and an outer bag called
> > hourlyTotals with a schema of (hour:int, collected:int)
> >
> > accumulator = 0L; -- I know this line doesn't work
> > ORDER hourlyTotals BY collected;
> > cumulativeTotals = FOREACH hourlyTotals {
> >                        accumulator += collected;
> >                        GENERATE day, accumulator AS collected;
> >                        }
> >
> > Could something like this be made to work? Is there something similar that
> > I can do instead? Do I just need to pipe the relation through an
> > external script to get what I want?
> >
> > Thanks,
> > Kris
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
> >
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB