On Aug 13, 2012, at 9:05 AM, Benjamin Smedberg wrote:
> I'm a new-ish pig user querying data on an hbase cluster. I have a question about accumulator-style functions.
> When writing an accumulator-style UDF, is all of the data shipped to a single machine before it is reduced/accumulated? For example, if I were doing to write re-implement SUM as a UDF, it seems to me that it would be more efficient to run SUM on each map node, and then do a sum-of-sums when reducing. Is there a way to write a UDF which supports this style of accumulation/aggregation?
How many reducers are involved in an operation is independent of the type of UDF you use. The number of reducers is determined by the parallelism you declare in your script (via the parallel clause in your group statement or via a set default parallelism statement in your script) or by the default Pig chooses.
As to whether it is more efficient to do a sum of sums, it certainly is. For those types of operations you should use an algebraic UDF rather than an accumulative. Algebraic UDFs have an initial (map), intermediate (combiner), and final (reducer) steps. Accumulative UDFs are for operations that cannot be distributed but that only need to see the data stream once. An example would be cumulative sums, where you want to return not just a final sum but a list of the sums as you went along. This is order dependent and thus can't be done until you've collected all the values for a given key.
> Also, is PigStorage compatible with the quoting expected by excel tab-delimited files? AIUI that would require quoting the values with "value\tvalue" and escaping double quotes. If this isn't the native PigStorage format, is there a storage module already written which supports excel-tab output?
PigStorage doesn't support escaping. I am not aware of a storage function focussed on excel CSV format, but others may be.