|
|
-
Re: WordPairCount Mapreduce question.Harsh J 2013-02-25, 09:17
Also noteworthy is that the performance gain can only be had (from the
byte level compare method) iff the serialization/deserialization/format of data is comparable at the byte level. One such provider is Apache Avro: http://avro.apache.org/docs/current/spec.html#order. Most other implementations simply deserialize again from the bytestream and then compare, which has a higher (or, regular) cost. On Mon, Feb 25, 2013 at 1:44 PM, Mahesh Balija <[EMAIL PROTECTED]> wrote: > byte array comparison is for performance reasons only, but NOT the way you > are thinking. > This method comes from an interface called RawComparator which provides the > prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, > int l2);) for this method. > In the sorting phase where the keys are sorted, because of this > implementation the records are read from the stream directly and sorted > without the need to deserializing them into Objects. > > Best, > Mahesh Balija, > CalsoftLabs. > > > On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <[EMAIL PROTECTED]> wrote: >> >> Thanks Mahesh for your help. >> >> Wondering if u can provide some insight with the below compare method >> using byte[] in the SecondarySort example: >> >> public static class Comparator extends WritableComparator { >> public Comparator() { >> super(URICountKey.class); >> } >> >> public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, >> int l2) { >> return compareBytes(b1, s1, l1, b2, s2, l2); >> } >> } >> >> My question is in the below compare method that i have given we are >> comparing word1/word2 >> which makes sense but what about this byte[] comparison, is it right in >> assuming it converts each objects word1/word2/word3 to byte[] and compares >> them. >> If so is it for performance reason it is done. >> Could you please verify. >> Thanks >> Sai >> ________________________________ >> From: Mahesh Balija <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED]; Sai Sai <[EMAIL PROTECTED]> >> Sent: Saturday, 23 February 2013 5:23 AM >> Subject: Re: WordPairCount Mapreduce question. >> >> Please check the in-line answers... >> >> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <[EMAIL PROTECTED]> wrote: >> >> >> Hello >> >> I have a question about how Mapreduce sorting works internally with >> multiple columns. >> >> Below r my classes using 2 columns in an input file given below. >> >> 1st question: About the method hashCode, we r adding a "31 + ", i am >> wondering why is this required. what does 31 refer to. >> >> This is how usually hashcode is calculated for any String instance >> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of >> the String. Since in your case you only have 2 chars then it will be a * >> 31^0 + b * 31^1. >> >> >> >> 2nd question: what if my input file has 3 columns instead of 2 how would >> you write a compare method and was wondering if anyone can map this to a >> real world scenario it will be really helpful. >> >> you will extend the same approach for the third column, >> public int compareTo(WordPairCountKey o) { >> int diff = word1.compareTo(o.word1); >> if (diff == 0) { >> diff = word2.compareTo(o.word2); >> if(diff==0){ >> diff = word3.compareTo(o.word3); >> } >> } >> return diff; >> } >> >> >> >> >> @Override >> public int compareTo(WordPairCountKey o) { >> int diff = word1.compareTo(o.word1); >> if (diff == 0) { >> diff = word2.compareTo(o.word2); >> } >> return diff; >> } >> >> @Override >> public int hashCode() { >> return word1.hashCode() + 31 * word2.hashCode(); >> } >> >> ****************************** >> >> Here is my input file wordpair.txt >> >> ****************************** >> >> a b >> a c >> a b >> a d >> b d >> e f >> b d >> e f >> b d > Harsh J |