Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> secondary sort - number of reducers

Copy link to this message
RE: secondary sort - number of reducers
Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort all happened at this time, and they are the hard part.
But before we simply say it is part of life, you need to dig into more of your MR jobs to find out if you can make it faster.
You are the person most familiar with your data, and you wrote the code to group/partition them, and send them to the reducers. Even you set up 255 reducers, the question is, do each of them get its fair share?You need to read the COUNTER information of each reducer, and found out how many reducer groups each reducer gets, and how many input bytes it get, etc.
Simple example, if you send 200G data, and group them by DATE, if all the data belongs to 2 days, and one of them contains 90% of data, then in this case, giving 255 reducers won't help, as only 2 reducers will consume data, and one of them will consume 90% of data, and will finish in a very long time, which WILL delay the whole MR job, while the rest reducers will finish within seconds. In this case, maybe you need to rethink what should be your key, and make sure each reducer get its fair share of volume of data.
After the above fix (in fact, normally it will fix 90% of reducer performance problems, especially you have 255 reducer tasks available, so each one average will only get 1G data, good for your huge cluster only needs to process 256G data :-), if you want to make it even faster, then check you code. Do you have to use String.compareTo()? Is it slow?  Google hadoop rawcomparator to see if you can do something here.
After that, if you still think the reducer stage slow, check you cluster system. Does the reducer spend most time on copy stage, or sort, or in your reducer class? Find out the where the time spends, then identify the solution.

Date: Fri, 30 Aug 2013 11:02:05 -0400
Subject: Re: secondary sort - number of reducers

my secondary sort on multiple keys seem to work fine with smaller data sets but with bigger data sets (like 256 gig and 800M+ records) the mapper phase gets done pretty quick (about 15 mins) but then the reducer phase seem to take forever. I am using 255 reducers.

basic idea is that my composite key has both group and sort keys in it which i parse in the appropriate comparator classes to perform grouping and sorting .. my thinking is that mappers is where most of the work is done
1. mapper itself (create composite key and value)2. recods sorting3. partiotioner
if all this gets done in 15 mins then reducer has the simple task of1. grouping comparator
2. reducer itself (simply output records)
should take less time than mappers .. instead it essentially gets stuck in reduce phase .. im gonna paste my code here to see if anything stands out as a fundamental design issue

//////PARTITIONERpublic int getPartition(Text key, HCatRecord record, int numReduceTasks) { //extract the group key from composite key
String groupKey = key.toString().split("\\|")[0]; return (groupKey.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

////////////GROUP COMAPRATORpublic int compare(WritableComparable a, WritableComparable b) { //compare to text objects
String thisGroupKey = ((Text) a).toString().split("\\|")[0]; String otherGroupKey = ((Text) b).toString().split("\\|")[0];
//extract return thisGroupKey.compareTo(otherGroupKey); }
////////////SORT COMPARATOR is similar to group comparator and is in map phase and gets done quick
public void reduce(Text key, Iterable<HCatRecord> records, Context context) throws IOException, InterruptedException { log.info("in reducer for key " + key.toString());
Iterator<HCatRecord> recordsIter = records.iterator(); //we are only interested in the first record after sorting and grouping
if(recordsIter.hasNext()){ HCatRecord rec = recordsIter.next(); context.write(nw, rec);
log.info("returned record >> " + rec.toString()); } }
On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi <[EMAIL PROTECTED]> wrote:

yup it was negative and by doing this now it seems to be working fine
On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma <[EMAIL PROTECTED]> wrote:
Is the hash code of that key  is negative.?

Do something like this

return groupKey.hashCode() & Integer.MAX_VALUE % numParts;


Som Shekhar Sharma


On Fri, Aug 30, 2013 at 6:25 AM, Adeel Qureshi <[EMAIL PROTECTED]> wrote: