Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> SIZE() always leads to 1 reducer?


+
Yang 2013-04-11, 22:13
Copy link to this message
-
Re: SIZE() always leads to 1 reducer?
Hi Yang,

A couple points: the grouping of z will create exactly one input group for
the reducers. Since there's only one, more reducers doesn't help any. There
are accumulator and algebraic UDFs, but SIZE is not one of them because
SIZE can also take data types other than bags (you can't split the
computation of the SIZE of a chararray, for example). Since you're using it
for a bag, the builtin UDF 'COUNT' (
http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/builtin/COUNT.html) is
a much more scalable approach. It will do some aggregation in the combiner
and scale (much, much) better.

Thanks,
Mark

On Thu, Apr 11, 2013 at 3:13 PM, Yang <[EMAIL PROTECTED]> wrote:

> I set default_parallel=15
>
> but when I did a
>
> y = group z ALL;
> x = foreach y generate SIZE(z);
>
> the 2 lines generate a MR job with only 1 reducer.
>
>
> I guess it's because SIZE() needs to count all the groups. but don't we
> have the sort of cumulative/additive UDFs ?
>
>
> it would be faster if we could parallelize SIZE()
>
> thanks
> Yang
>