Yang 2013-04-11, 22:13
A couple points: the grouping of z will create exactly one input group for
the reducers. Since there's only one, more reducers doesn't help any. There
are accumulator and algebraic UDFs, but SIZE is not one of them because
SIZE can also take data types other than bags (you can't split the
computation of the SIZE of a chararray, for example). Since you're using it
for a bag, the builtin UDF 'COUNT' (
a much more scalable approach. It will do some aggregation in the combiner
and scale (much, much) better.
On Thu, Apr 11, 2013 at 3:13 PM, Yang <[EMAIL PROTECTED]> wrote:
> I set default_parallel=15
> but when I did a
> y = group z ALL;
> x = foreach y generate SIZE(z);
> the 2 lines generate a MR job with only 1 reducer.
> I guess it's because SIZE() needs to count all the groups. but don't we
> have the sort of cumulative/additive UDFs ?
> it would be faster if we could parallelize SIZE()