Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Reducer estimation

Copy link to this message
Reducer estimation
I have been thinking about using Pig statistics for # reducers estimation.
Though the current heuristic approach
works fine, it requires an admin or the programmer to understand what the
best number would be for the job.
We are seeing a large number of jobs over-utilizing resources, and there is
obviously no default number that works well
for all kinds of pig scripts. A few non-technical users find it difficult
to estimate the best # for their jobs.
It would be great if we can use stats from previous runs of a job to set
the number
of reducers for future runs.

This would be a nice feature for jobs running in production, where the job
or the dataset size does not fluctuate
a huge deal.
   1. Set a config param in the script
      - set script.unique.id prashant.1111222111.demo_script
   2. If the above is not set, we fallback on the current implementation
   3. If the above is set
      - At the end of the job, persist PigStats (namely Reduce Shuffle
      Bytes) to FS (hdfs, local, s3....). This would be
      - lets call this stats_on_hdfs
      - Read "stats_on_hdfs" for previous runs, and based on the number of
      such stats to read (based on
script.reducer.estimation.num.stats) calculate
      an average number of reducers for the current run.
      - If no stats_on_hdfs exists, we fallback on current implementation

It will be advised to not keep the retention of stats too long, and Pig can
make sure to clear up old files that are not required.

What do you guys think?