Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # dev - Reducer estimation


+
Prashant Kommireddi 2012-12-04, 06:32
+
Bill Graham 2012-12-04, 07:01
+
Prashant Kommireddi 2012-12-04, 07:10
+
Dmitriy Ryaboy 2012-12-06, 14:38
Copy link to this message
-
Re: Reducer estimation
Prashant Kommireddi 2012-12-06, 14:44
You could store stats for the last X runs and take an average for the next run.

Sent from my iPhone

On Dec 6, 2012, at 8:09 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> How would flat files work? The data needs to be updated by every pig run.
>
> On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote:
>
>> Awesome! It would be good to have a flat-file based impl as there will
>> probably a lot of pig users not having an hbase instance setup for
>> stats persistence. Let me know if I can help in anyway.
>>
>> Is there a timeframe you are looking at for open-sourcing this?
>>
>>
>> On Dec 4, 2012, at 12:32 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>
>>> We do basically what you're describing. Each of our scripts has a logical
>>> name which defines the workflow. For each job in the workflow we persist
>>> the job stats, counters and conf in HBase via an implementation of
>>> PigProgressNotificationListener. We can then correlate jobs in a run of the
>>> workflow together based on the pig.script.start.time and pig.job.start time
>>> properties. We use the logical plan script signature to determine the
>>> script version has changed.
>>>
>>> During job execution we query the service in a impl of PigReducerEstimator
>>> for matching workflows.
>>>
>>> One simple estimation algo is to multiply Pig's default estimated reducers
>>> (which are based on mapper input bytes) by the ratio of mapper output bytes
>>> over mapper input bytes of previous runs. The same could also be done with
>>> slot time, but we haven't tried that yet.
>>>
>>> We plan to open source parts of this at some point.
>>>
>>>
>>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
>>>
>>>> I have been thinking about using Pig statistics for # reducers estimation.
>>>> Though the current heuristic approach
>>>> works fine, it requires an admin or the programmer to understand what the
>>>> best number would be for the job.
>>>> We are seeing a large number of jobs over-utilizing resources, and there is
>>>> obviously no default number that works well
>>>> for all kinds of pig scripts. A few non-technical users find it difficult
>>>> to estimate the best # for their jobs.
>>>> It would be great if we can use stats from previous runs of a job to set
>>>> the number
>>>> of reducers for future runs.
>>>>
>>>> This would be a nice feature for jobs running in production, where the job
>>>> or the dataset size does not fluctuate
>>>> a huge deal.
>>>>
>>>>
>>>> 1. Set a config param in the script
>>>>    - set script.unique.id prashant.1111222111.demo_script
>>>> 2. If the above is not set, we fallback on the current implementation
>>>> 3. If the above is set
>>>>    - At the end of the job, persist PigStats (namely Reduce Shuffle
>>>>    Bytes) to FS (hdfs, local, s3....). This would be
>>>> "${script.unique.id}_YYYYMMDDHHmmss"
>>>>    - lets call this stats_on_hdfs
>>>>    - Read "stats_on_hdfs" for previous runs, and based on the number of
>>>>    such stats to read (based on
>>>> script.reducer.estimation.num.stats) calculate
>>>>    an average number of reducers for the current run.
>>>>    - If no stats_on_hdfs exists, we fallback on current implementation
>>>>
>>>> It will be advised to not keep the retention of stats too long, and Pig can
>>>> make sure to clear up old files that are not required.
>>>>
>>>> What do you guys think?
>>>>
>>>> -Prashant
>>>>
>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>>> [EMAIL PROTECTED] going forward.*