|
|
-
Accumulator is not fired
Yen SYU 2012-03-13, 19:00
Hi all,
I just test a very simple pig script as following:
records = LOAD '$input' AS (hash:chararray, domain:chararray, host:chararray, page:chararray, freq:int); grpd = GROUP records BY (domain, host); stats = FOREACH grpd { hashes = records.hash; uniq_hashes = DISTINCT hashes; pages = records.page; GENERATE group.$1 AS host, group.$0 AS domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS page_count:long, SUM(freq) AS freq:long); }; STORE stats INTO '$output';
where PAGE_COUNT is a customized UDF implementing Accumulator. I add an EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the accumulate method is never called. Even I tried to remove all other built-in UDFs and keep the NESTED FOREACH as simple as:
stats = FOREACH grpd { pages = records.page; GENERATE group.$1 AS host, group.$0 AS domain, PAGE_COUNT(pages) AS page_count:long; };
Anyone idea what's going on behind the scenes?
Thanks, Yen
-
Re: Accumulator is not fired
Jonathan Coveney 2012-03-13, 19:06
What version of pig are you using?
just as an experiment in the simple case, can you try doing
GENERATE flatten(group) as (domain,host), ...(the rest)...
shouldn't make a difference, but I think I remember that in some older versions it did
2012/3/13 Yen SYU <[EMAIL PROTECTED]>
> Hi all, > > I just test a very simple pig script as following: > > records = LOAD '$input' AS (hash:chararray, domain:chararray, > host:chararray, page:chararray, freq:int); > grpd = GROUP records BY (domain, host); > stats = FOREACH grpd { > hashes = records.hash; > uniq_hashes = DISTINCT hashes; > pages = records.page; > GENERATE group.$1 AS host, group.$0 AS > domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS > page_count:long, SUM(freq) AS freq:long); > }; > STORE stats INTO '$output'; > > where PAGE_COUNT is a customized UDF implementing Accumulator. I add an > EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the > accumulate method is never called. Even I tried to remove all other > built-in UDFs and keep the NESTED FOREACH as simple as: > > stats = FOREACH grpd { > pages = records.page; > GENERATE group.$1 AS host, group.$0 AS > domain, PAGE_COUNT(pages) AS page_count:long; > }; > > Anyone idea what's going on behind the scenes? > > Thanks, > Yen >
-
Re: Accumulator is not fired
Yen SYU 2012-03-13, 19:22
Hi Jon,
Thanks for your reponse! I use pig 0.9.1-snapshot.
I've used FLATTEN instead of $0 and $1, but ACCUM_CALL is still not fired. Also tried to remove generic type in accumulator but it did not help. :(
Is it easy for you to fire accumulator?
Yen
On Tue, Mar 13, 2012 at 3:06 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
> What version of pig are you using? > > just as an experiment in the simple case, can you try doing > > GENERATE flatten(group) as (domain,host), ...(the rest)... > > shouldn't make a difference, but I think I remember that in some older > versions it did > > 2012/3/13 Yen SYU <[EMAIL PROTECTED]> > > > Hi all, > > > > I just test a very simple pig script as following: > > > > records = LOAD '$input' AS (hash:chararray, domain:chararray, > > host:chararray, page:chararray, freq:int); > > grpd = GROUP records BY (domain, host); > > stats = FOREACH grpd { > > hashes = records.hash; > > uniq_hashes = DISTINCT hashes; > > pages = records.page; > > GENERATE group.$1 AS host, group.$0 AS > > domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS > > page_count:long, SUM(freq) AS freq:long); > > }; > > STORE stats INTO '$output'; > > > > where PAGE_COUNT is a customized UDF implementing Accumulator. I add an > > EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the > > accumulate method is never called. Even I tried to remove all other > > built-in UDFs and keep the NESTED FOREACH as simple as: > > > > stats = FOREACH grpd { > > pages = records.page; > > GENERATE group.$1 AS host, group.$0 AS > > domain, PAGE_COUNT(pages) AS page_count:long; > > }; > > > > Anyone idea what's going on behind the scenes? > > > > Thanks, > > Yen > > >
-
Re: Accumulator is not fired
Thejas Nair 2012-03-16, 01:34
Hi Yen, Does the function also implement Algebraic ? In that case it might end up using the algebraic interface of the udf. If your foreach statement has functions that don't implement Accumulator interface, then reduce task won't run in accumulative mode. This is because you are anyway going to load the whole bag into memory.
If the query is using accumulator mode, you would see this log message - INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AccumulatorOptimizer - Reducer is to run in accumulative mode. I tried modifying your query to -
stats = FOREACH grpd { pages = records.page; GENERATE group.$1 AS host, group.$0 AS domain, COUNT(pages) AS page_count:long; };
and ran it disabling the combiner - bin/pig -Dpig.exec.nocombiner=true -x local -e 'explain -script /tmp/t.pig;'
I was able to verify that it would run Accumulator mode, using the above log message.
Thanks, Thejas On 3/13/12 12:22 PM, Yen SYU wrote: > Hi Jon, > > Thanks for your reponse! I use pig 0.9.1-snapshot. > > I've used FLATTEN instead of $0 and $1, but ACCUM_CALL is still not fired. > Also tried to remove generic type in accumulator but it did not help. :( > > Is it easy for you to fire accumulator? > > Yen > > On Tue, Mar 13, 2012 at 3:06 PM, Jonathan Coveney<[EMAIL PROTECTED]>wrote: > >> What version of pig are you using? >> >> just as an experiment in the simple case, can you try doing >> >> GENERATE flatten(group) as (domain,host), ...(the rest)... >> >> shouldn't make a difference, but I think I remember that in some older >> versions it did >> >> 2012/3/13 Yen SYU<[EMAIL PROTECTED]> >> >>> Hi all, >>> >>> I just test a very simple pig script as following: >>> >>> records = LOAD '$input' AS (hash:chararray, domain:chararray, >>> host:chararray, page:chararray, freq:int); >>> grpd = GROUP records BY (domain, host); >>> stats = FOREACH grpd { >>> hashes = records.hash; >>> uniq_hashes = DISTINCT hashes; >>> pages = records.page; >>> GENERATE group.$1 AS host, group.$0 AS >>> domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS >>> page_count:long, SUM(freq) AS freq:long); >>> }; >>> STORE stats INTO '$output'; >>> >>> where PAGE_COUNT is a customized UDF implementing Accumulator. I add an >>> EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the >>> accumulate method is never called. Even I tried to remove all other >>> built-in UDFs and keep the NESTED FOREACH as simple as: >>> >>> stats = FOREACH grpd { >>> pages = records.page; >>> GENERATE group.$1 AS host, group.$0 AS >>> domain, PAGE_COUNT(pages) AS page_count:long; >>> }; >>> >>> Anyone idea what's going on behind the scenes? >>> >>> Thanks, >>> Yen >>> >> >
-
Re: Accumulator is not fired
Yen SYU 2012-03-16, 14:27
Hi Thejas,
Thank you for your reply!
I agree that accumulator mode can be used if you only use built-in UDFs. :)
I noticed what you mentioned in your reply in the past. In my script, PAGE_COUNT is an evaluation function which accumulator is the only interface implemented. I also check built-in UDF COUNT and SUM's source code - it implements both algebraic and accumulator. Theoretically pig should use accumulator mode in this situation.
I also notice sometimes if my pig script is VERY SIMPLE, such as without extracting a lot of tuples/fields in nested foreach, the PAGE_COUNT can be called in accumulator mode, but not in all situations.
I'm really curious if there are other situations that can break accumulator call - or do I nedd to manually turn on some optimization for pig to fire accumulator? Cannot find a lot of relative resource online...
Best, Yen
On Thu, Mar 15, 2012 at 9:34 PM, Thejas Nair <[EMAIL PROTECTED]> wrote:
> Hi Yen, > Does the function also implement Algebraic ? In that case it might end up > using the algebraic interface of the udf. > If your foreach statement has functions that don't implement Accumulator > interface, then reduce task won't run in accumulative mode. This is because > you are anyway going to load the whole bag into memory. > > If the query is using accumulator mode, you would see this log message - > INFO org.apache.pig.backend.hadoop.**executionengine.**mapReduceLayer.**AccumulatorOptimizer > - Reducer is to run in accumulative mode. > > > I tried modifying your query to - > > > stats = FOREACH grpd { > pages = records.page; > GENERATE group.$1 AS host, group.$0 AS > domain, COUNT(pages) AS page_count:long; > }; > > and ran it disabling the combiner - > bin/pig -Dpig.exec.nocombiner=true -x local -e 'explain -script > /tmp/t.pig;' > > I was able to verify that it would run Accumulator mode, using the above > log message. > > Thanks, > Thejas > > > > > > On 3/13/12 12:22 PM, Yen SYU wrote: > >> Hi Jon, >> >> Thanks for your reponse! I use pig 0.9.1-snapshot. >> >> I've used FLATTEN instead of $0 and $1, but ACCUM_CALL is still not fired. >> Also tried to remove generic type in accumulator but it did not help. :( >> >> Is it easy for you to fire accumulator? >> >> Yen >> >> On Tue, Mar 13, 2012 at 3:06 PM, Jonathan Coveney<[EMAIL PROTECTED]>** >> wrote: >> >> What version of pig are you using? >>> >>> just as an experiment in the simple case, can you try doing >>> >>> GENERATE flatten(group) as (domain,host), ...(the rest)... >>> >>> shouldn't make a difference, but I think I remember that in some older >>> versions it did >>> >>> 2012/3/13 Yen SYU<[EMAIL PROTECTED]> >>> >>> Hi all, >>>> >>>> I just test a very simple pig script as following: >>>> >>>> records = LOAD '$input' AS (hash:chararray, domain:chararray, >>>> host:chararray, page:chararray, freq:int); >>>> grpd = GROUP records BY (domain, host); >>>> stats = FOREACH grpd { >>>> hashes = records.hash; >>>> uniq_hashes = DISTINCT hashes; >>>> pages = records.page; >>>> GENERATE group.$1 AS host, group.$0 AS >>>> domain, COUNT(uniq_hashes) AS hash_total:long, PAGE_COUNT(pages) AS >>>> page_count:long, SUM(freq) AS freq:long); >>>> }; >>>> STORE stats INTO '$output'; >>>> >>>> where PAGE_COUNT is a customized UDF implementing Accumulator. I add an >>>> EXEC_CALL and ACCUM_CALL counter in this UDF and it looks that the >>>> accumulate method is never called. Even I tried to remove all other >>>> built-in UDFs and keep the NESTED FOREACH as simple as: >>>> >>>> stats = FOREACH grpd { >>>> pages = records.page; >>>> GENERATE group.$1 AS host, group.$0 AS >>>> domain, PAGE_COUNT(pages) AS page_count:long; >>>> }; >>>> >>>> Anyone idea what's going on behind the scenes?
|
|