Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Secondary sort in hadoop with avro


Copy link to this message
-
RE: Secondary sort in hadoop with avro
Jacob Metcalf 2012-09-11, 22:09

Frank
I have spent a bit of time doing this recently but with MR2 and CDH4 which may not be appropriate to your use case. However assuming some similarities, I suspect your problem is that you also need to override compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator.
The advantage to Avro is that Hadoop does not need to deserialize to sort in the shuffle. This function in RawComparator allows Hadoop to quickly compare the bytes directly.
Whilst this seems a bit daunting my trick to doing this in MR2 is to leverage Avro's excellent support for projections - subsets of schemas. For example let's say you want to "group" by attribute A but then "sort" by attribute B. In this case I would use a composite key with schema {A, B} and the out of the box AvroKeyComparator as the sort comparator. Then I would implement my own grouping comparator which uses a schema of just {A} then uses the BinaryData function to compare:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java
I assume you can do something similar in MR1.
Regards
Jacob

> Subject: Secondary sort in hadoop with avro
> From: [EMAIL PROTECTED]
> Date: Tue, 11 Sep 2012 17:36:06 +0200
> To: [EMAIL PROTECTED]
>
> I need to implement secondary sort within an avro based MR sequence. I however find little to documentation or examples online.
> I would like to implement this by overriding the  'int compare(AvroWrapper<T> x, AvroWrapper<T> y)' method but I fail to have it invoked.
> Does anybody have experience implementing secondary sort on deserialised avro objects ?
>
> Some help, advise or pointers will be very much appreciated !