Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Reducer estimation


Copy link to this message
-
Re: Reducer estimation
You could store stats for the last X runs and take an average for the next run.

Sent from my iPhone

On Dec 6, 2012, at 8:09 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> How would flat files work? The data needs to be updated by every pig run.
>
> On Dec 3, 2012, at 11:10 PM, Prashant Kommireddi <[EMAIL PROTECTED]> wrote:
>
>> Awesome! It would be good to have a flat-file based impl as there will
>> probably a lot of pig users not having an hbase instance setup for
>> stats persistence. Let me know if I can help in anyway.
>>
>> Is there a timeframe you are looking at for open-sourcing this?
>>
>>
>> On Dec 4, 2012, at 12:32 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>
>>> We do basically what you're describing. Each of our scripts has a logical
>>> name which defines the workflow. For each job in the workflow we persist
>>> the job stats, counters and conf in HBase via an implementation of
>>> PigProgressNotificationListener. We can then correlate jobs in a run of the
>>> workflow together based on the pig.script.start.time and pig.job.start time
>>> properties. We use the logical plan script signature to determine the
>>> script version has changed.
>>>
>>> During job execution we query the service in a impl of PigReducerEstimator
>>> for matching workflows.
>>>
>>> One simple estimation algo is to multiply Pig's default estimated reducers
>>> (which are based on mapper input bytes) by the ratio of mapper output bytes
>>> over mapper input bytes of previous runs. The same could also be done with
>>> slot time, but we haven't tried that yet.
>>>
>>> We plan to open source parts of this at some point.
>>>
>>>
>>> On Mon, Dec 3, 2012 at 10:32 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
>>>
>>>> I have been thinking about using Pig statistics for # reducers estimation.
>>>> Though the current heuristic approach
>>>> works fine, it requires an admin or the programmer to understand what the
>>>> best number would be for the job.
>>>> We are seeing a large number of jobs over-utilizing resources, and there is
>>>> obviously no default number that works well
>>>> for all kinds of pig scripts. A few non-technical users find it difficult
>>>> to estimate the best # for their jobs.
>>>> It would be great if we can use stats from previous runs of a job to set
>>>> the number
>>>> of reducers for future runs.
>>>>
>>>> This would be a nice feature for jobs running in production, where the job
>>>> or the dataset size does not fluctuate
>>>> a huge deal.
>>>>
>>>>
>>>> 1. Set a config param in the script
>>>>    - set script.unique.id prashant.1111222111.demo_script
>>>> 2. If the above is not set, we fallback on the current implementation
>>>> 3. If the above is set
>>>>    - At the end of the job, persist PigStats (namely Reduce Shuffle
>>>>    Bytes) to FS (hdfs, local, s3....). This would be
>>>> "${script.unique.id}_YYYYMMDDHHmmss"
>>>>    - lets call this stats_on_hdfs
>>>>    - Read "stats_on_hdfs" for previous runs, and based on the number of
>>>>    such stats to read (based on
>>>> script.reducer.estimation.num.stats) calculate
>>>>    an average number of reducers for the current run.
>>>>    - If no stats_on_hdfs exists, we fallback on current implementation
>>>>
>>>> It will be advised to not keep the retention of stats too long, and Pig can
>>>> make sure to clear up old files that are not required.
>>>>
>>>> What do you guys think?
>>>>
>>>> -Prashant
>>>>
>>>
>>>
>>>
>>> --
>>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>>> [EMAIL PROTECTED] going forward.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB