Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> when Algebraic UDF are used ?


+
Benoit Mathieu 2012-07-25, 16:32
+
pablomar 2012-07-25, 16:41
+
Alan Gates 2012-07-25, 16:40
+
Benoit Mathieu 2012-07-25, 17:07
+
pablomar 2012-07-25, 17:21
+
Benoit Mathieu 2012-07-25, 17:32
Copy link to this message
-
Re: when Algebraic UDF are used ?
side note: sorry if it sounded bad. it is not RTFM response. I've just sent
you the better explanation I could. And that book explain it better than I
can
On Wed, Jul 25, 2012 at 1:21 PM, pablomar
<[EMAIL PROTECTED]>wrote:

> from the same book (
> http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html)
>
> "Memory Issues in Eval Funcs
>
> Some operations you will do in your UDFs will require more memory than is
> available. As an example you may want to build a UDF that calculates the
> cumulative sum of a set of inputs. This will return a bag of values since
> for each input it needs to return the intermediate sum at that input.
>
> Pig's bags handle spilling data to disk automatically when they pass a
> certain size threshold, or when only a certain amount of heap space
> remains. Spilling to disk is expensive, and whenever possible should be
> avoided. But if you must store large amounts of data in a bag, Pig will
> manage it.
>
> Bags are the only Pig datatype that know how to spill. Tuple and maps must
> fit into memory. Bags that are too large to fit in memory can still be
> referenced in a tuple or a map. This will not be counted as those tuples or
> maps not fitting into memory"
>
>
>
>
> On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote:
>
>> Thanks for your answers.
>>
>> So, I have further questions.
>> Sorting the bag myself in my UDF whould solve my problem, but I don't know
>> what happen with bags that does not fit in memory.
>> How does Pig manage large bags ? How are they passed to UDF ?
>>
>> ++
>> benoit
>>
>>
>> 2012/7/25 Alan Gates <[EMAIL PROTECTED]>
>>
>> > It can't use the algebraic interface in this case because the data has
>> to
>> > be sorted (which means it has to see all the data) before passing it to
>> > your UDF.  If you remove the ORDER statement then the algebraic portion
>> of
>> > your UDF will be invoked.
>> >
>> > Alan.
>> >
>> > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:
>> >
>> > > Hi pig users,
>> > >
>> > > I have coded my own algebraic UDF in Java, and it seems that pig do
>> not
>> > use
>> > > the algebraic interface at all. (I put some log messages in my
>> > > Initial,Intermed and Final functions, and they re never logged).
>> > > Pig uses only the main "exec" function.
>> > >
>> > > My UDF needs to get the bag sorted.
>> > > Here is my pig script:
>> > >
>> > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
>> > > B = GROUP A BY k1;
>> > > C = FOREACH B {
>> > >  tmp = ORDER A.(k2,value) BY k2;
>> > >  GENERATE group, MyUDF(tmp);
>> > > }
>> > > ...
>> > >
>> > >
>> > > Does anyone know why pig does not use the algebraic interface ?
>> > >
>> > > thanks,
>> > >
>> > > Benoit
>> >
>> >
>>
>
>
+
Benoit Mathieu 2012-07-25, 16:40
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB