Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - phases of Hadoop Jobs


+
Nan Zhu 2011-09-19, 02:24
+
He Chen 2011-09-19, 03:42
+
Arun C Murthy 2011-09-19, 04:17
+
Kai Voigt 2011-09-19, 04:23
+
Arun C Murthy 2011-09-19, 04:26
+
GOEKE, MATTHEW 2011-09-19, 19:19
Copy link to this message
-
Re: phases of Hadoop Jobs
Nan Zhu 2011-09-19, 05:01
Hi, Arun ,

Thanks!

As you explained,  in the hadoop, we cannot explicitly divide job as two
phase, map and reduce, but only for reduce task, we can judge which stage
it's in, (shuffle, sort, reduce) (with 0.23 , we can also do it with
mappers, )

right?

Nan

On Mon, Sep 19, 2011 at 12:17 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

> Nan,
>
>  The 'phase' is implicitly understood by the 'progress' (value) made by the
> map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
>
>  For e.g.
>  Reduce:
>  0-33% -> Shuffle
>  34-66% -> Sort (actually, just 'merge', there is no sort in the reduce
> since all map-outputs are sorted)
>  67-100% -> Reduce
>
>  With 0.23 onwards the Map has phases too:
>  0-90% -> Map
>  91-100% -> Final Sort/merge
>
>  Now,about starting reduces early - this is done to ensure shuffle can
> proceed for completed maps while rest of the maps run, there-by pipelining
> shuffle and map completion. There is a 'reduce slowstart' feature to control
> this - by default, reduces aren't started until 5% of maps are complete.
> Users can set this higher.
>
> Arun
>
> On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
>
> > Hi, all
> >
> > recently, I was hit by a question, "how is a hadoop job divided into 2
> > phases?",
> >
> > In textbooks, we are told that the mapreduce jobs are divided into 2
> phases,
> > map and reduce, and for reduce, we further divided it into 3 stages,
> > shuffle, sort, and reduce, but in hadoop codes, I never think about
> > this question, I didn't see any variable members in JobInProgress class
> > to indicate this information,
> >
> > and according to my understanding on the source code of hadoop, the
> reduce
> > tasks are unnecessarily started until all mappers are finished, in
> > constract, we can see the reduce tasks are in shuffle stage while there
> are
> > mappers which are still in running,
> > So how can I indicate the phase which the job is belonging to?
> >
> > Thanks
> > --
> > Nan Zhu
> > School of Electronic, Information and Electrical Engineering,229
> > Shanghai Jiao Tong University
> > 800,Dongchuan Road,Shanghai,China
> > E-Mail: [EMAIL PROTECTED]
>
>
--
Nan Zhu
School of Electronic, Information and Electrical Engineering,229
Shanghai Jiao Tong University
800,Dongchuan Road,Shanghai,China
E-Mail: [EMAIL PROTECTED]
+
He Chen 2011-09-19, 05:29
+
Kai Voigt 2011-09-19, 05:36
+
He Chen 2011-09-19, 06:20
+
He Chen 2011-09-19, 06:24
+
Kai Voigt 2011-09-19, 06:33