Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> when Algebraic UDF are used ?


+
Benoit Mathieu 2012-07-25, 16:32
+
pablomar 2012-07-25, 16:41
+
Alan Gates 2012-07-25, 16:40
+
Benoit Mathieu 2012-07-25, 17:07
Copy link to this message
-
Re: when Algebraic UDF are used ?
from the same book (
http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html)

"Memory Issues in Eval Funcs

Some operations you will do in your UDFs will require more memory than is
available. As an example you may want to build a UDF that calculates the
cumulative sum of a set of inputs. This will return a bag of values since
for each input it needs to return the intermediate sum at that input.

Pig's bags handle spilling data to disk automatically when they pass a
certain size threshold, or when only a certain amount of heap space
remains. Spilling to disk is expensive, and whenever possible should be
avoided. But if you must store large amounts of data in a bag, Pig will
manage it.

Bags are the only Pig datatype that know how to spill. Tuple and maps must
fit into memory. Bags that are too large to fit in memory can still be
referenced in a tuple or a map. This will not be counted as those tuples or
maps not fitting into memory"

On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote:

> Thanks for your answers.
>
> So, I have further questions.
> Sorting the bag myself in my UDF whould solve my problem, but I don't know
> what happen with bags that does not fit in memory.
> How does Pig manage large bags ? How are they passed to UDF ?
>
> ++
> benoit
>
>
> 2012/7/25 Alan Gates <[EMAIL PROTECTED]>
>
> > It can't use the algebraic interface in this case because the data has to
> > be sorted (which means it has to see all the data) before passing it to
> > your UDF.  If you remove the ORDER statement then the algebraic portion
> of
> > your UDF will be invoked.
> >
> > Alan.
> >
> > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:
> >
> > > Hi pig users,
> > >
> > > I have coded my own algebraic UDF in Java, and it seems that pig do not
> > use
> > > the algebraic interface at all. (I put some log messages in my
> > > Initial,Intermed and Final functions, and they re never logged).
> > > Pig uses only the main "exec" function.
> > >
> > > My UDF needs to get the bag sorted.
> > > Here is my pig script:
> > >
> > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> > > B = GROUP A BY k1;
> > > C = FOREACH B {
> > >  tmp = ORDER A.(k2,value) BY k2;
> > >  GENERATE group, MyUDF(tmp);
> > > }
> > > ...
> > >
> > >
> > > Does anyone know why pig does not use the algebraic interface ?
> > >
> > > thanks,
> > >
> > > Benoit
> >
> >
>
+
Benoit Mathieu 2012-07-25, 17:32
+
pablomar 2012-07-25, 17:25
+
Benoit Mathieu 2012-07-25, 16:40
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB