|
|
-
Re: when Algebraic UDF are used ?pablomar 2012-07-25, 17:25
side note: sorry if it sounded bad. it is not RTFM response. I've just sent
you the better explanation I could. And that book explain it better than I can On Wed, Jul 25, 2012 at 1:21 PM, pablomar <[EMAIL PROTECTED]>wrote: > from the same book ( > http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html) > > "Memory Issues in Eval Funcs > > Some operations you will do in your UDFs will require more memory than is > available. As an example you may want to build a UDF that calculates the > cumulative sum of a set of inputs. This will return a bag of values since > for each input it needs to return the intermediate sum at that input. > > Pig's bags handle spilling data to disk automatically when they pass a > certain size threshold, or when only a certain amount of heap space > remains. Spilling to disk is expensive, and whenever possible should be > avoided. But if you must store large amounts of data in a bag, Pig will > manage it. > > Bags are the only Pig datatype that know how to spill. Tuple and maps must > fit into memory. Bags that are too large to fit in memory can still be > referenced in a tuple or a map. This will not be counted as those tuples or > maps not fitting into memory" > > > > > On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote: > >> Thanks for your answers. >> >> So, I have further questions. >> Sorting the bag myself in my UDF whould solve my problem, but I don't know >> what happen with bags that does not fit in memory. >> How does Pig manage large bags ? How are they passed to UDF ? >> >> ++ >> benoit >> >> >> 2012/7/25 Alan Gates <[EMAIL PROTECTED]> >> >> > It can't use the algebraic interface in this case because the data has >> to >> > be sorted (which means it has to see all the data) before passing it to >> > your UDF. If you remove the ORDER statement then the algebraic portion >> of >> > your UDF will be invoked. >> > >> > Alan. >> > >> > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote: >> > >> > > Hi pig users, >> > > >> > > I have coded my own algebraic UDF in Java, and it seems that pig do >> not >> > use >> > > the algebraic interface at all. (I put some log messages in my >> > > Initial,Intermed and Final functions, and they re never logged). >> > > Pig uses only the main "exec" function. >> > > >> > > My UDF needs to get the bag sorted. >> > > Here is my pig script: >> > > >> > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); >> > > B = GROUP A BY k1; >> > > C = FOREACH B { >> > > tmp = ORDER A.(k2,value) BY k2; >> > > GENERATE group, MyUDF(tmp); >> > > } >> > > ... >> > > >> > > >> > > Does anyone know why pig does not use the algebraic interface ? >> > > >> > > thanks, >> > > >> > > Benoit >> > >> > >> > > |