Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Reduce Task Priority / Scheduler


Copy link to this message
-
Re: Reduce Task Priority / Scheduler

This makes sense until you realize:

a) It won't scale.

b) Machines fail.

On Dec 20, 2010, at 5:26 AM, Martin Becker wrote:

> I wrote a little bit much, so I put a summary up front. Sorry about that.
>
> Summary:
> 1) Is there any point in time, where one single instance of Hadoop has
> access to all keys that are to be distributed to the nodes together
> with corresponding data? Or maybe at least nodes could have Task
> priorities, killing and rescheduling tasks, if higher priority tasks
> arrive (key word: Partitioner, TaskScheduler).
>
> 2) Blocking running tasks, does not get me anywhere, if they are not
> suspended allowing other Reducer to take their place as this is taking
> up Reducer slots, isn't it? The main problem are Reducers waiting for
> a slot.
>
> 3) Why Reducer ordering (could) affect(s) processing.
>
> ad 1)
> 1.1) The only thing the JobTracker would need to do is to look at keys
> and derive some job internal order of Reduce tasks. At this point it
> would be necessary for the JobTracker (or _any other_ instance which
> would be able to do such a thing!) to know about how many Reducers are
> to start for a specific job, what their keys are or at least about
> their priority.
>
> 1.2) At some point the Partitioner is distributing keys to nodes.
> Meaning it could at least group high quality with low quality tasks
> (based on some criterion). Now cannot _for example_ the TaskTrackers
> themselves decide which task - that was assigned to them by the
> Partitioner - to execute first?
>
> 1.3) And the question is basically, is there some instance that CAN do
> some prioritizing of Tasks, the way I want it to. Even if it is only
> in combination with the Partitioner: "Oh, a Task with lower priority
> is running. So kill it, and restart it later." Maybe this would work
> using something else but the FairScheduler. I am making wild guesses
> here, but I think I am drifting towards the TaskScheduler, if it
> actually does what I think it does.
>
> ad 2) Blocking:
> If I have enough slots for all the Reduce tasks, I have no problem at
> all. There is no sense at all in starting a task and then blocking it.
> Why not let it run? It is not like the Reducer have to wait for some
> other to finish. They could just quit working/not even start, if there
> output is redundant (see "ad 3)").
>
> ad 3) This is why Reducer ordering affects processing:
> Preliminaries:
> * Each Reducer raises a (global! - using ZooKeeper, FileSystem or
> maybe Counters) threshold.
> * Each Reducer can estimate if it will ever pass a given threshold.
> * Output of Reducers that cannot pass the threshold is discarded.
> * Some Reducers have a higher probability (by Key) to raise the
> threshold faster.
>
> As a result it would make sense to run Reducers with a higher
> probability to raise the threshold first. Reducers can cease their
> work or not even start, if they cannot pass the threshold anymore.
>
>
> On Mon, Dec 20, 2010 at 11:58 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>> The JobTracker wouldn't know what your data is going to be is when it
>> is assigning the Reduce Tasks.
>>
>> If you really do need ordering among your reducers, you should
>> implement a locking mechanism (making sure the dormant reduce tasks
>> stay alive by sending out some status reports).
>>
>> Although, how is ordering going to affect your reducer's processing? :)
>>
>> On Mon, Dec 20, 2010 at 2:37 PM, Martin Becker <[EMAIL PROTECTED]> wrote:
>>> I just reread my first post. Maybe I was not clear enough:
>>> It is only important to me that the Reduce tasks _start_ in a
>>> specified order based on their key. That is the only additional
>>> constraint I need.
>>>
>>> On Mon, Dec 20, 2010 at 9:51 AM, Martin Becker <[EMAIL PROTECTED]> wrote:
>>>> As far as I understood, MapReduce is waiting for all Mappers to finish
>>>> until it starts running Reduce tasks. Am I mistaken here? If I am not,
>>>> then I do not see any more synchrony being introduced than there
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB