I have been thinking about using Pig statistics for # reducers estimation.
Though the current heuristic approach
works fine, it requires an admin or the programmer to understand what the
best number would be for the job.
We are seeing a large number of jobs over-utilizing resources, and there is
obviously no default number that works well
for all kinds of pig scripts. A few non-technical users find it difficult
to estimate the best # for their jobs.
It would be great if we can use stats from previous runs of a job to set
of reducers for future runs.
This would be a nice feature for jobs running in production, where the job
or the dataset size does not fluctuate
a huge deal.
1. Set a config param in the script
- set script.unique.id prashant.1111222111.demo_script
2. If the above is not set, we fallback on the current implementation
3. If the above is set
- At the end of the job, persist PigStats (namely Reduce Shuffle
Bytes) to FS (hdfs, local, s3....). This would be
- lets call this stats_on_hdfs
- Read "stats_on_hdfs" for previous runs, and based on the number of
such stats to read (based on
an average number of reducers for the current run.
- If no stats_on_hdfs exists, we fallback on current implementation
It will be advised to not keep the retention of stats too long, and Pig can
make sure to clear up old files that are not required.
What do you guys think?