Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: WordPairCount Mapreduce question.


+
Sai Sai 2013-02-23, 12:52
+
Mahesh Balija 2013-02-23, 13:23
+
Mahesh Balija 2013-02-25, 08:14
Copy link to this message
-
Re: WordPairCount Mapreduce question.
Also noteworthy is that the performance gain can only be had (from the
byte level compare method) iff the
serialization/deserialization/format of data is comparable at the byte
level. One such provider is Apache Avro:
http://avro.apache.org/docs/current/spec.html#order.

Most other implementations simply deserialize again from the
bytestream and then compare, which has a higher (or, regular) cost.

On Mon, Feb 25, 2013 at 1:44 PM, Mahesh Balija
<[EMAIL PROTECTED]> wrote:
> byte array comparison is for performance reasons only, but NOT the way you
> are thinking.
> This method comes from an interface called RawComparator which provides the
> prototype (public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
> int l2);) for this method.
> In the sorting phase where the keys are sorted, because of this
> implementation the records are read from the stream directly and sorted
> without the need to deserializing them into Objects.
>
> Best,
> Mahesh Balija,
> CalsoftLabs.
>
>
> On Sun, Feb 24, 2013 at 5:01 PM, Sai Sai <[EMAIL PROTECTED]> wrote:
>>
>> Thanks Mahesh for your help.
>>
>> Wondering if u can provide some insight with the below compare method
>> using byte[] in the SecondarySort example:
>>
>> public static class Comparator extends WritableComparator {
>>         public Comparator() {
>>             super(URICountKey.class);
>>         }
>>
>>         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2,
>> int l2) {
>>             return compareBytes(b1, s1, l1, b2, s2, l2);
>>         }
>>     }
>>
>> My question is in the below compare method that i have given we are
>> comparing word1/word2
>> which makes sense but what about this byte[] comparison, is it right in
>> assuming  it converts each objects word1/word2/word3 to byte[] and compares
>> them.
>> If so is it for performance reason it is done.
>> Could you please verify.
>> Thanks
>> Sai
>> ________________________________
>> From: Mahesh Balija <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]; Sai Sai <[EMAIL PROTECTED]>
>> Sent: Saturday, 23 February 2013 5:23 AM
>> Subject: Re: WordPairCount Mapreduce question.
>>
>> Please check the in-line answers...
>>
>> On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hello
>>
>> I have a question about how Mapreduce sorting works internally with
>> multiple columns.
>>
>> Below r my classes using 2 columns in an input file given below.
>>
>> 1st question: About the method hashCode, we r adding a "31 + ", i am
>> wondering why is this required. what does 31 refer to.
>>
>> This is how usually hashcode is calculated for any String instance
>> (s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of
>> the String. Since in your case you only have 2 chars then it will be a *
>> 31^0 + b * 31^1.
>>
>>
>>
>> 2nd question: what if my input file has 3 columns instead of 2 how would
>> you write a compare method and was wondering if anyone can map this to a
>> real world scenario it will be really helpful.
>>
>> you will extend the same approach for the third column,
>>  public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>             if(diff==0){
>>                  diff = word3.compareTo(o.word3);
>>             }
>>         }
>>         return diff;
>>     }
>>
>>
>>
>>
>>     @Override
>>     public int compareTo(WordPairCountKey o) {
>>         int diff = word1.compareTo(o.word1);
>>         if (diff == 0) {
>>             diff = word2.compareTo(o.word2);
>>         }
>>         return diff;
>>     }
>>
>>     @Override
>>     public int hashCode() {
>>         return word1.hashCode() + 31 * word2.hashCode();
>>     }
>>
>> ******************************
>>
>> Here is my input file wordpair.txt
>>
>> ******************************
>>
>> a    b
>> a    c
>> a    b
>> a    d
>> b    d
>> e    f
>> b    d
>> e    f
>> b    d
>
Harsh J