|
|
-
Re: Significance of file.out.index during Shuffle Phase ?Arun C Murthy 2012-08-20, 02:54
You'll need to make significant changes MapTask.java which won't make it back to the mainline.
Why? We had this before and quickly ran out of inodes on the local-disk. Think of large jobs with 10,000 maps * 1000 reduces -> that's 10M files. Arun On Aug 19, 2012, at 8:57 AM, Pavan Kulkarni wrote: > Ohh ,Thanks a lot Harsh. Exactly what I was looking for. > I wanted to create different file.out's for different reducers. Something > like > file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to do > this in the MapReduce program or I need to tweak some Hadoop source files > for that? Thanks. > > On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Hey Pavan, >> >> Yes you've got it almost right on how file.out is served to each >> reducer. See the code at >> >> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup >> (Method under L502:L565 that sends data for a specific >> reduce/partition ID (integer)). >> >> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <[EMAIL PROTECTED]> >> wrote: >>> Hi, >>> >>> I was trying to understand how exactly the reducers find out how to >> fetch >>> the data of its own partition from Map nodes. >>> During the executions of MapReduce, I see that *file.out* is created on >> Map >>> nodes, so my question is how does a reducer >>> know what part of file.out to fetch? Is the *file.out.index* play any >> role? >>> Any help is appreciated .Thanks >>> >>> >>> >>> --With Regards >>> Pavan Kulkarni >> >> >> >> -- >> Harsh J >> > > > > -- > > --With Regards > Pavan Kulkarni -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ |