Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: secondary sort - number of reducers

Copy link to this message
Re: secondary sort - number of reducers
Ravi Kiran 2013-08-31, 12:04
   To add to Yong's  points
a)   Consider tuning the number of threads in reduce tasks and the task
tracker process.  mapred.reduce.parallel.copies
b)   See if the map output can be compressed to ensure there is less IO .
c)   Increase the io.sort.factor to ensure the framework merges a larger
number of files in each merge sort at the reducer
d)   Check the counter "Reduce Shuffle Bytes" of  each reducer to see any
skew of data at few reducers. Try for a even distribution of load through a
better partitioner code.

Ravi Magham
On Fri, Aug 30, 2013 at 9:28 PM, java8964 java8964 <[EMAIL PROTECTED]>wrote:

> Well, The reducers normally will take much longer than the mappers stage,
> because the copy/shuffle/sort all happened at this time, and they are the
> hard part.
> But before we simply say it is part of life, you need to dig into more of
> your MR jobs to find out if you can make it faster.
> You are the person most familiar with your data, and you wrote the code to
> group/partition them, and send them to the reducers. Even you set up 255
> reducers, the question is, do each of them get its fair share?
> You need to read the COUNTER information of each reducer, and found out
> how many reducer groups each reducer gets, and how many input bytes it get,
> etc.
> Simple example, if you send 200G data, and group them by DATE, if all the
> data belongs to 2 days, and one of them contains 90% of data, then in this
> case, giving 255 reducers won't help, as only 2 reducers will consume data,
> and one of them will consume 90% of data, and will finish in a very long
> time, which WILL delay the whole MR job, while the rest reducers will
> finish within seconds. In this case, maybe you need to rethink what should
> be your key, and make sure each reducer get its fair share of volume of
> data.
> After the above fix (in fact, normally it will fix 90% of reducer
> performance problems, especially you have 255 reducer tasks available, so
> each one average will only get 1G data, good for your huge cluster only
> needs to process 256G data :-), if you want to make it even faster, then
> check you code. Do you have to use String.compareTo()? Is it slow?  Google
> hadoop rawcomparator to see if you can do something here.
> After that, if you still think the reducer stage slow, check you cluster
> system. Does the reducer spend most time on copy stage, or sort, or in your
> reducer class? Find out the where the time spends, then identify the
> solution.
> Yong
> ------------------------------
> Date: Fri, 30 Aug 2013 11:02:05 -0400
> Subject: Re: secondary sort - number of reducers
> my secondary sort on multiple keys seem to work fine with smaller data
> sets but with bigger data sets (like 256 gig and 800M+ records) the mapper
> phase gets done pretty quick (about 15 mins) but then the reducer phase
> seem to take forever. I am using 255 reducers.
> basic idea is that my composite key has both group and sort keys in it
> which i parse in the appropriate comparator classes to perform grouping and
> sorting .. my thinking is that mappers is where most of the work is done
> 1. mapper itself (create composite key and value)
> 2. recods sorting
> 3. partiotioner
> if all this gets done in 15 mins then reducer has the simple task of
> 1. grouping comparator
> 2. reducer itself (simply output records)
> should take less time than mappers .. instead it essentially gets stuck in
> reduce phase .. im gonna paste my code here to see if anything stands out
> as a fundamental design issue
> public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
> //extract the group key from composite key
>  String groupKey = key.toString().split("\\|")[0];
> return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>  }
> ////////////GROUP COMAPRATOR
> public int compare(WritableComparable a, WritableComparable b) {