We do basically what you're describing. Each of our scripts has a logical
name which defines the workflow. For each job in the workflow we persist
the job stats, counters and conf in HBase via an implementation of
PigProgressNotificationListener. We can then correlate jobs in a run of the
workflow together based on the pig.script.start.time and pig.job.start time
properties. We use the logical plan script signature to determine the
script version has changed.
During job execution we query the service in a impl of PigReducerEstimator
for matching workflows.
One simple estimation algo is to multiply Pig's default estimated reducers
(which are based on mapper input bytes) by the ratio of mapper output bytes
over mapper input bytes of previous runs. The same could also be done with
slot time, but we haven't tried that yet.
We plan to open source parts of this at some point.
On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> I have been thinking about using Pig statistics for # reducers estimation.
> Though the current heuristic approach
> works fine, it requires an admin or the programmer to understand what the
> best number would be for the job.
> We are seeing a large number of jobs over-utilizing resources, and there is
> obviously no default number that works well
> for all kinds of pig scripts. A few non-technical users find it difficult
> to estimate the best # for their jobs.
> It would be great if we can use stats from previous runs of a job to set
> the number
> of reducers for future runs.
> This would be a nice feature for jobs running in production, where the job
> or the dataset size does not fluctuate
> a huge deal.
> 1. Set a config param in the script
> - set script.unique.id prashant.1111222111.demo_script
> 2. If the above is not set, we fallback on the current implementation
> 3. If the above is set
> - At the end of the job, persist PigStats (namely Reduce Shuffle
> Bytes) to FS (hdfs, local, s3....). This would be
> - lets call this stats_on_hdfs
> - Read "stats_on_hdfs" for previous runs, and based on the number of
> such stats to read (based on
> script.reducer.estimation.num.stats) calculate
> an average number of reducers for the current run.
> - If no stats_on_hdfs exists, we fallback on current implementation
> It will be advised to not keep the retention of stats too long, and Pig can
> make sure to clear up old files that are not required.
> What do you guys think?
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*