Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - when Algebraic UDF are used ?


+
Benoit Mathieu 2012-07-25, 16:32
+
pablomar 2012-07-25, 16:41
+
Alan Gates 2012-07-25, 16:40
+
Benoit Mathieu 2012-07-25, 17:07
+
pablomar 2012-07-25, 17:21
+
Benoit Mathieu 2012-07-25, 17:32
Copy link to this message
-
Re: when Algebraic UDF are used ?
pablomar 2012-07-25, 17:25
side note: sorry if it sounded bad. it is not RTFM response. I've just sent
you the better explanation I could. And that book explain it better than I
can
On Wed, Jul 25, 2012 at 1:21 PM, pablomar
<[EMAIL PROTECTED]>wrote:

> from the same book (
> http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html)
>
> "Memory Issues in Eval Funcs
>
> Some operations you will do in your UDFs will require more memory than is
> available. As an example you may want to build a UDF that calculates the
> cumulative sum of a set of inputs. This will return a bag of values since
> for each input it needs to return the intermediate sum at that input.
>
> Pig's bags handle spilling data to disk automatically when they pass a
> certain size threshold, or when only a certain amount of heap space
> remains. Spilling to disk is expensive, and whenever possible should be
> avoided. But if you must store large amounts of data in a bag, Pig will
> manage it.
>
> Bags are the only Pig datatype that know how to spill. Tuple and maps must
> fit into memory. Bags that are too large to fit in memory can still be
> referenced in a tuple or a map. This will not be counted as those tuples or
> maps not fitting into memory"
>
>
>
>
> On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote:
>
>> Thanks for your answers.
>>
>> So, I have further questions.
>> Sorting the bag myself in my UDF whould solve my problem, but I don't know
>> what happen with bags that does not fit in memory.
>> How does Pig manage large bags ? How are they passed to UDF ?
>>
>> ++
>> benoit
>>
>>
>> 2012/7/25 Alan Gates <[EMAIL PROTECTED]>
>>
>> > It can't use the algebraic interface in this case because the data has
>> to
>> > be sorted (which means it has to see all the data) before passing it to
>> > your UDF.  If you remove the ORDER statement then the algebraic portion
>> of
>> > your UDF will be invoked.
>> >
>> > Alan.
>> >
>> > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:
>> >
>> > > Hi pig users,
>> > >
>> > > I have coded my own algebraic UDF in Java, and it seems that pig do
>> not
>> > use
>> > > the algebraic interface at all. (I put some log messages in my
>> > > Initial,Intermed and Final functions, and they re never logged).
>> > > Pig uses only the main "exec" function.
>> > >
>> > > My UDF needs to get the bag sorted.
>> > > Here is my pig script:
>> > >
>> > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
>> > > B = GROUP A BY k1;
>> > > C = FOREACH B {
>> > >  tmp = ORDER A.(k2,value) BY k2;
>> > >  GENERATE group, MyUDF(tmp);
>> > > }
>> > > ...
>> > >
>> > >
>> > > Does anyone know why pig does not use the algebraic interface ?
>> > >
>> > > thanks,
>> > >
>> > > Benoit
>> >
>> >
>>
>
>
+
Benoit Mathieu 2012-07-25, 16:40