|
Benoit Mathieu
2012-07-25, 16:32
Alan Gates
2012-07-25, 16:40
Benoit Mathieu
2012-07-25, 16:40
pablomar
2012-07-25, 16:41
Benoit Mathieu
2012-07-25, 17:07
pablomar
2012-07-25, 17:21
pablomar
2012-07-25, 17:25
Benoit Mathieu
2012-07-25, 17:32
|
-
when Algebraic UDF are used ?Benoit Mathieu 2012-07-25, 16:32
Hi pig users,
I have coded my own algebraic UDF in Java, and it seems that pig do not use the algebraic interface at all. (I put some log messages in my Initial,Intermed and Final functions, and they re never logged). Pig uses only the main "exec" function. My UDF needs to get the bag sorted. Here is my pig script: A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); B = GROUP A BY k1; C = FOREACH B { tmp = ORDER A.(k2,value) BY k2; GENERATE group, MyUDF(tmp); } ... Does anyone know why pig does not use the algebraic interface ? thanks, Benoit
-
Re: when Algebraic UDF are used ?Alan Gates 2012-07-25, 16:40
It can't use the algebraic interface in this case because the data has to be sorted (which means it has to see all the data) before passing it to your UDF. If you remove the ORDER statement then the algebraic portion of your UDF will be invoked.
Alan. On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote: > Hi pig users, > > I have coded my own algebraic UDF in Java, and it seems that pig do not use > the algebraic interface at all. (I put some log messages in my > Initial,Intermed and Final functions, and they re never logged). > Pig uses only the main "exec" function. > > My UDF needs to get the bag sorted. > Here is my pig script: > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); > B = GROUP A BY k1; > C = FOREACH B { > tmp = ORDER A.(k2,value) BY k2; > GENERATE group, MyUDF(tmp); > } > ... > > > Does anyone know why pig does not use the algebraic interface ? > > thanks, > > Benoit
-
Re: when Algebraic UDF are used ?Benoit Mathieu 2012-07-25, 16:40
I'm using pig 0.9.2 from CDH4 packaging.
++ benoit 2012/7/25 Benoit Mathieu <[EMAIL PROTECTED]> > Hi pig users, > > I have coded my own algebraic UDF in Java, and it seems that pig do not > use the algebraic interface at all. (I put some log messages in my > Initial,Intermed and Final functions, and they re never logged). > Pig uses only the main "exec" function. > > My UDF needs to get the bag sorted. > Here is my pig script: > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); > B = GROUP A BY k1; > C = FOREACH B { > tmp = ORDER A.(k2,value) BY k2; > GENERATE group, MyUDF(tmp); > } > ... > > > Does anyone know why pig does not use the algebraic interface ? > > thanks, > > Benoit >
-
Re: when Algebraic UDF are used ?pablomar 2012-07-25, 16:41
according to: http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html
"Implementing Algebraic does not guarantee that the algebraic implementation will always be used. Pig only chooses the algebraic implementation if all UDFs in the same foreach statement are algebraic. This is because our testing has shown that using the combiner with data that cannot be combined significantly slows down the job. And there is no way in Hadoop to route some data to the combiner (for algebraic functions) and some straight to the reducer (for non-algebraic). This means that your UDF must always implement the exec method, even if you hope it will always be used in the algebraic mode. It is also an additional motivation to implement algebraic for your UDFs when possible." On Wed, Jul 25, 2012 at 12:32 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote: > Hi pig users, > > I have coded my own algebraic UDF in Java, and it seems that pig do not use > the algebraic interface at all. (I put some log messages in my > Initial,Intermed and Final functions, and they re never logged). > Pig uses only the main "exec" function. > > My UDF needs to get the bag sorted. > Here is my pig script: > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); > B = GROUP A BY k1; > C = FOREACH B { > tmp = ORDER A.(k2,value) BY k2; > GENERATE group, MyUDF(tmp); > } > ... > > > Does anyone know why pig does not use the algebraic interface ? > > thanks, > > Benoit >
-
Re: when Algebraic UDF are used ?Benoit Mathieu 2012-07-25, 17:07
Thanks for your answers.
So, I have further questions. Sorting the bag myself in my UDF whould solve my problem, but I don't know what happen with bags that does not fit in memory. How does Pig manage large bags ? How are they passed to UDF ? ++ benoit 2012/7/25 Alan Gates <[EMAIL PROTECTED]> > It can't use the algebraic interface in this case because the data has to > be sorted (which means it has to see all the data) before passing it to > your UDF. If you remove the ORDER statement then the algebraic portion of > your UDF will be invoked. > > Alan. > > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote: > > > Hi pig users, > > > > I have coded my own algebraic UDF in Java, and it seems that pig do not > use > > the algebraic interface at all. (I put some log messages in my > > Initial,Intermed and Final functions, and they re never logged). > > Pig uses only the main "exec" function. > > > > My UDF needs to get the bag sorted. > > Here is my pig script: > > > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); > > B = GROUP A BY k1; > > C = FOREACH B { > > tmp = ORDER A.(k2,value) BY k2; > > GENERATE group, MyUDF(tmp); > > } > > ... > > > > > > Does anyone know why pig does not use the algebraic interface ? > > > > thanks, > > > > Benoit > >
-
Re: when Algebraic UDF are used ?pablomar 2012-07-25, 17:21
from the same book (
http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html) "Memory Issues in Eval Funcs Some operations you will do in your UDFs will require more memory than is available. As an example you may want to build a UDF that calculates the cumulative sum of a set of inputs. This will return a bag of values since for each input it needs to return the intermediate sum at that input. Pig's bags handle spilling data to disk automatically when they pass a certain size threshold, or when only a certain amount of heap space remains. Spilling to disk is expensive, and whenever possible should be avoided. But if you must store large amounts of data in a bag, Pig will manage it. Bags are the only Pig datatype that know how to spill. Tuple and maps must fit into memory. Bags that are too large to fit in memory can still be referenced in a tuple or a map. This will not be counted as those tuples or maps not fitting into memory" On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote: > Thanks for your answers. > > So, I have further questions. > Sorting the bag myself in my UDF whould solve my problem, but I don't know > what happen with bags that does not fit in memory. > How does Pig manage large bags ? How are they passed to UDF ? > > ++ > benoit > > > 2012/7/25 Alan Gates <[EMAIL PROTECTED]> > > > It can't use the algebraic interface in this case because the data has to > > be sorted (which means it has to see all the data) before passing it to > > your UDF. If you remove the ORDER statement then the algebraic portion > of > > your UDF will be invoked. > > > > Alan. > > > > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote: > > > > > Hi pig users, > > > > > > I have coded my own algebraic UDF in Java, and it seems that pig do not > > use > > > the algebraic interface at all. (I put some log messages in my > > > Initial,Intermed and Final functions, and they re never logged). > > > Pig uses only the main "exec" function. > > > > > > My UDF needs to get the bag sorted. > > > Here is my pig script: > > > > > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); > > > B = GROUP A BY k1; > > > C = FOREACH B { > > > tmp = ORDER A.(k2,value) BY k2; > > > GENERATE group, MyUDF(tmp); > > > } > > > ... > > > > > > > > > Does anyone know why pig does not use the algebraic interface ? > > > > > > thanks, > > > > > > Benoit > > > > >
-
Re: when Algebraic UDF are used ?pablomar 2012-07-25, 17:25
side note: sorry if it sounded bad. it is not RTFM response. I've just sent
you the better explanation I could. And that book explain it better than I can On Wed, Jul 25, 2012 at 1:21 PM, pablomar <[EMAIL PROTECTED]>wrote: > from the same book ( > http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html) > > "Memory Issues in Eval Funcs > > Some operations you will do in your UDFs will require more memory than is > available. As an example you may want to build a UDF that calculates the > cumulative sum of a set of inputs. This will return a bag of values since > for each input it needs to return the intermediate sum at that input. > > Pig's bags handle spilling data to disk automatically when they pass a > certain size threshold, or when only a certain amount of heap space > remains. Spilling to disk is expensive, and whenever possible should be > avoided. But if you must store large amounts of data in a bag, Pig will > manage it. > > Bags are the only Pig datatype that know how to spill. Tuple and maps must > fit into memory. Bags that are too large to fit in memory can still be > referenced in a tuple or a map. This will not be counted as those tuples or > maps not fitting into memory" > > > > > On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote: > >> Thanks for your answers. >> >> So, I have further questions. >> Sorting the bag myself in my UDF whould solve my problem, but I don't know >> what happen with bags that does not fit in memory. >> How does Pig manage large bags ? How are they passed to UDF ? >> >> ++ >> benoit >> >> >> 2012/7/25 Alan Gates <[EMAIL PROTECTED]> >> >> > It can't use the algebraic interface in this case because the data has >> to >> > be sorted (which means it has to see all the data) before passing it to >> > your UDF. If you remove the ORDER statement then the algebraic portion >> of >> > your UDF will be invoked. >> > >> > Alan. >> > >> > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote: >> > >> > > Hi pig users, >> > > >> > > I have coded my own algebraic UDF in Java, and it seems that pig do >> not >> > use >> > > the algebraic interface at all. (I put some log messages in my >> > > Initial,Intermed and Final functions, and they re never logged). >> > > Pig uses only the main "exec" function. >> > > >> > > My UDF needs to get the bag sorted. >> > > Here is my pig script: >> > > >> > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); >> > > B = GROUP A BY k1; >> > > C = FOREACH B { >> > > tmp = ORDER A.(k2,value) BY k2; >> > > GENERATE group, MyUDF(tmp); >> > > } >> > > ... >> > > >> > > >> > > Does anyone know why pig does not use the algebraic interface ? >> > > >> > > thanks, >> > > >> > > Benoit >> > >> > >> > >
-
Re: when Algebraic UDF are used ?Benoit Mathieu 2012-07-25, 17:32
Thanks !
++ benoit 2012/7/25 pablomar <[EMAIL PROTECTED]> > from the same book ( > http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html) > > "Memory Issues in Eval Funcs > > Some operations you will do in your UDFs will require more memory than is > available. As an example you may want to build a UDF that calculates the > cumulative sum of a set of inputs. This will return a bag of values since > for each input it needs to return the intermediate sum at that input. > > Pig's bags handle spilling data to disk automatically when they pass a > certain size threshold, or when only a certain amount of heap space > remains. Spilling to disk is expensive, and whenever possible should be > avoided. But if you must store large amounts of data in a bag, Pig will > manage it. > > Bags are the only Pig datatype that know how to spill. Tuple and maps must > fit into memory. Bags that are too large to fit in memory can still be > referenced in a tuple or a map. This will not be counted as those tuples or > maps not fitting into memory" > > > > On Wed, Jul 25, 2012 at 1:07 PM, Benoit Mathieu <[EMAIL PROTECTED]> wrote: > > > Thanks for your answers. > > > > So, I have further questions. > > Sorting the bag myself in my UDF whould solve my problem, but I don't > know > > what happen with bags that does not fit in memory. > > How does Pig manage large bags ? How are they passed to UDF ? > > > > ++ > > benoit > > > > > > 2012/7/25 Alan Gates <[EMAIL PROTECTED]> > > > > > It can't use the algebraic interface in this case because the data has > to > > > be sorted (which means it has to see all the data) before passing it to > > > your UDF. If you remove the ORDER statement then the algebraic portion > > of > > > your UDF will be invoked. > > > > > > Alan. > > > > > > On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote: > > > > > > > Hi pig users, > > > > > > > > I have coded my own algebraic UDF in Java, and it seems that pig do > not > > > use > > > > the algebraic interface at all. (I put some log messages in my > > > > Initial,Intermed and Final functions, and they re never logged). > > > > Pig uses only the main "exec" function. > > > > > > > > My UDF needs to get the bag sorted. > > > > Here is my pig script: > > > > > > > > A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int); > > > > B = GROUP A BY k1; > > > > C = FOREACH B { > > > > tmp = ORDER A.(k2,value) BY k2; > > > > GENERATE group, MyUDF(tmp); > > > > } > > > > ... > > > > > > > > > > > > Does anyone know why pig does not use the algebraic interface ? > > > > > > > > thanks, > > > > > > > > Benoit > > > > > > > > > |