|
|
-
Significance of file.out.index during Shuffle Phase ?
Pavan Kulkarni 2012-08-19, 03:35
Hi,
I was trying to understand how exactly the reducers find out how to fetch the data of its own partition from Map nodes. During the executions of MapReduce, I see that *file.out* is created on Map nodes, so my question is how does a reducer know what part of file.out to fetch? Is the *file.out.index* play any role? Any help is appreciated .Thanks
--With Regards Pavan Kulkarni
+
Pavan Kulkarni 2012-08-19, 03:35
+
Harsh J 2012-08-19, 11:02
-
Re: Significance of file.out.index during Shuffle Phase ?
Pavan Kulkarni 2012-08-19, 15:57
Ohh ,Thanks a lot Harsh. Exactly what I was looking for. I wanted to create different file.out's for different reducers. Something like file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do this in the MapReduce program or I need to tweak some Hadoop source files for that? Thanks. On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Hey Pavan, > > Yes you've got it almost right on how file.out is served to each > reducer. See the code at > > http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup> (Method under L502:L565 that sends data for a specific > reduce/partition ID (integer)). > > On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > I was trying to understand how exactly the reducers find out how to > fetch > > the data of its own partition from Map nodes. > > During the executions of MapReduce, I see that *file.out* is created on > Map > > nodes, so my question is how does a reducer > > know what part of file.out to fetch? Is the *file.out.index* play any > role? > > Any help is appreciated .Thanks > > > > > > > > --With Regards > > Pavan Kulkarni > > > > -- > Harsh J > -- --With Regards Pavan Kulkarni
+
Pavan Kulkarni 2012-08-19, 15:57
-
答复: Significance of file.out.index during Shuffle Phase ?
俞盛朋 2012-08-20, 00:44
The MapReduce program would create an output file for each reducer, named "part-xxxxxx" by default -----邮件原件----- 发件人: Pavan Kulkarni [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月19日 23:58 收件人: [EMAIL PROTECTED] 主题: Re: Significance of file.out.index during Shuffle Phase ? Ohh ,Thanks a lot Harsh. Exactly what I was looking for. I wanted to create different file.out's for different reducers. Something like file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do this in the MapReduce program or I need to tweak some Hadoop source files for that? Thanks. On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Hey Pavan, > > Yes you've got it almost right on how file.out is served to each > reducer. See the code at > > http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-proj> ect/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/j > ava/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup > (Method under L502:L565 that sends data for a specific > reduce/partition ID (integer)). > > On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni > <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > I was trying to understand how exactly the reducers find out how > > to > fetch > > the data of its own partition from Map nodes. > > During the executions of MapReduce, I see that *file.out* is created > > on > Map > > nodes, so my question is how does a reducer know what part of > > file.out to fetch? Is the *file.out.index* play any > role? > > Any help is appreciated .Thanks > > > > > > > > --With Regards > > Pavan Kulkarni > > > > -- > Harsh J > -- --With Regards Pavan Kulkarni
-
Re: 答复: Significance of file.out.index during Shuffle Phase ?
Pavan Kulkarni 2012-08-20, 01:47
Hi, But I don't see those files during the executions.I only see file.out in the job_ID/attempID/output/ folder. On Sun, Aug 19, 2012 at 8:44 PM, 俞盛朋 <[EMAIL PROTECTED]> wrote: > The MapReduce program would create an output file for each reducer, named > "part-xxxxxx" by default > > -----邮件原件----- > 发件人: Pavan Kulkarni [mailto:[EMAIL PROTECTED]] > 发送时间: 2012年8月19日 23:58 > 收件人: [EMAIL PROTECTED] > 主题: Re: Significance of file.out.index during Shuffle Phase ? > > Ohh ,Thanks a lot Harsh. Exactly what I was looking for. > I wanted to create different file.out's for different reducers. Something > like > file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do > this in the MapReduce program or I need to tweak some Hadoop source files > for that? Thanks. > > On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[EMAIL PROTECTED]> wrote: > > > Hey Pavan, > > > > Yes you've got it almost right on how file.out is served to each > > reducer. See the code at > > > > http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-proj> > ect/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/j > > ava/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup > > (Method under L502:L565 that sends data for a specific > > reduce/partition ID (integer)). > > > > On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni > > <[EMAIL PROTECTED]> > > wrote: > > > Hi, > > > > > > I was trying to understand how exactly the reducers find out how > > > to > > fetch > > > the data of its own partition from Map nodes. > > > During the executions of MapReduce, I see that *file.out* is created > > > on > > Map > > > nodes, so my question is how does a reducer know what part of > > > file.out to fetch? Is the *file.out.index* play any > > role? > > > Any help is appreciated .Thanks > > > > > > > > > > > > --With Regards > > > Pavan Kulkarni > > > > > > > > -- > > Harsh J > > > > > > -- > > --With Regards > Pavan Kulkarni > > -- --With Regards Pavan Kulkarni
+
Pavan Kulkarni 2012-08-20, 01:47
-
答复: 答复: Significance of file.out.index during Shuffle Phase ?
俞盛朋 2012-08-20, 02:41
Oh sorry, I've misunderstood your question. Forget what I've said please -----邮件原件----- 发件人: Pavan Kulkarni [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月20日 9:48 收件人: [EMAIL PROTECTED] 主题: Re: 答复: Significance of file.out.index during Shuffle Phase ? Hi, But I don't see those files during the executions.I only see file.out in the job_ID/attempID/output/ folder. On Sun, Aug 19, 2012 at 8:44 PM, 俞盛朋 <[EMAIL PROTECTED]> wrote: > The MapReduce program would create an output file for each reducer, > named "part-xxxxxx" by default > > -----邮件原件----- > 发件人: Pavan Kulkarni [mailto:[EMAIL PROTECTED]] > 发送时间: 2012年8月19日 23:58 > 收件人: [EMAIL PROTECTED] > 主题: Re: Significance of file.out.index during Shuffle Phase ? > > Ohh ,Thanks a lot Harsh. Exactly what I was looking for. > I wanted to create different file.out's for different reducers. > Something like > file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible > to do this in the MapReduce program or I need to tweak some Hadoop > source files for that? Thanks. > > On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[EMAIL PROTECTED]> wrote: > > > Hey Pavan, > > > > Yes you've got it almost right on how file.out is served to each > > reducer. See the code at > > > > http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-pr> > oj > > ect/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main > > /j ava/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup > > (Method under L502:L565 that sends data for a specific > > reduce/partition ID (integer)). > > > > On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni > > <[EMAIL PROTECTED]> > > wrote: > > > Hi, > > > > > > I was trying to understand how exactly the reducers find out how > > > to > > fetch > > > the data of its own partition from Map nodes. > > > During the executions of MapReduce, I see that *file.out* is > > > created on > > Map > > > nodes, so my question is how does a reducer know what part of > > > file.out to fetch? Is the *file.out.index* play any > > role? > > > Any help is appreciated .Thanks > > > > > > > > > > > > --With Regards > > > Pavan Kulkarni > > > > > > > > -- > > Harsh J > > > > > > -- > > --With Regards > Pavan Kulkarni > > -- --With Regards Pavan Kulkarni
-
Re: Significance of file.out.index during Shuffle Phase ?
Arun C Murthy 2012-08-20, 02:54
You'll need to make significant changes MapTask.java which won't make it back to the mainline. Why? We had this before and quickly ran out of inodes on the local-disk. Think of large jobs with 10,000 maps * 1000 reduces -> that's 10M files. Arun On Aug 19, 2012, at 8:57 AM, Pavan Kulkarni wrote: > Ohh ,Thanks a lot Harsh. Exactly what I was looking for. > I wanted to create different file.out's for different reducers. Something > like > file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do > this in the MapReduce program or I need to tweak some Hadoop source files > for that? Thanks. > > On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Hey Pavan, >> >> Yes you've got it almost right on how file.out is served to each >> reducer. See the code at >> >> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup>> (Method under L502:L565 that sends data for a specific >> reduce/partition ID (integer)). >> >> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <[EMAIL PROTECTED]> >> wrote: >>> Hi, >>> >>> I was trying to understand how exactly the reducers find out how to >> fetch >>> the data of its own partition from Map nodes. >>> During the executions of MapReduce, I see that *file.out* is created on >> Map >>> nodes, so my question is how does a reducer >>> know what part of file.out to fetch? Is the *file.out.index* play any >> role? >>> Any help is appreciated .Thanks >>> >>> >>> >>> --With Regards >>> Pavan Kulkarni >> >> >> >> -- >> Harsh J >> > > > > -- > > --With Regards > Pavan Kulkarni -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
+
Arun C Murthy 2012-08-20, 02:54
-
Re: Significance of file.out.index during Shuffle Phase ?
Pavan Kulkarni 2012-08-21, 03:15
Arun, Yes got it now. Well what I am trying to do is store the intermediate data on a shared File System and create hardlinks to the MapOutputs(file.out) spilled by the Map nodes. This eliminates the copy phase of Shuffle stage. But now learning that the data for different reducers is partitioned across the same file(file.out) creating hardlinks wouldn't serve the purpose.Isn't it? Or is there a way to do it.? Please correct me if am wrong at any assumption. Thanks On Sun, Aug 19, 2012 at 10:54 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > You'll need to make significant changes MapTask.java which won't make it > back to the mainline. > > Why? We had this before and quickly ran out of inodes on the local-disk. > Think of large jobs with 10,000 maps * 1000 reduces -> that's 10M files. > > Arun > > On Aug 19, 2012, at 8:57 AM, Pavan Kulkarni wrote: > > > Ohh ,Thanks a lot Harsh. Exactly what I was looking for. > > I wanted to create different file.out's for different reducers. Something > > like > > file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to > do > > this in the MapReduce program or I need to tweak some Hadoop source files > > for that? Thanks. > > > > On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[EMAIL PROTECTED]> wrote: > > > >> Hey Pavan, > >> > >> Yes you've got it almost right on how file.out is served to each > >> reducer. See the code at > >> > >> > http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup> >> (Method under L502:L565 that sends data for a specific > >> reduce/partition ID (integer)). > >> > >> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni < > [EMAIL PROTECTED]> > >> wrote: > >>> Hi, > >>> > >>> I was trying to understand how exactly the reducers find out how to > >> fetch > >>> the data of its own partition from Map nodes. > >>> During the executions of MapReduce, I see that *file.out* is created on > >> Map > >>> nodes, so my question is how does a reducer > >>> know what part of file.out to fetch? Is the *file.out.index* play any > >> role? > >>> Any help is appreciated .Thanks > >>> > >>> > >>> > >>> --With Regards > >>> Pavan Kulkarni > >> > >> > >> > >> -- > >> Harsh J > >> > > > > > > > > -- > > > > --With Regards > > Pavan Kulkarni > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/> > > -- --With Regards Pavan Kulkarni
+
Pavan Kulkarni 2012-08-21, 03:15
|
|