Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Secondary sort in hadoop with avro


+
Frank Kootte 2012-09-11, 15:36
+
Jacob Metcalf 2012-09-11, 22:09
+
Frank Kootte 2012-09-12, 06:51
+
Frank Kootte 2012-09-12, 14:42
Copy link to this message
-
Re: Secondary sort in hadoop with avro
I suspect the best way would be to work out how to apply the techniques to MR1.

However for MR2 support look at AVRO-593 and odiago-avro on github. Garret Wu has written a series of extensions which support use of Avro in the shuffle. These have been integrated into Avro as of 17.

Jacob

-----Original Message-----

From: Frank Kootte
Sent: 12 Sep 2012 14:42:29 GMT
To: [EMAIL PROTECTED]
Subject: Re: Secondary sort in hadoop with avro

I would like to use MR2 in conjunction with avro but cannot find too much
documentation on the topic. Do you have any pointers in that region ?
AVRO 1.7.1 does not have any AvroReducer / Mapper in the mapreduce package.
I didnt look into it enough to see if perhaps the compatibility with the v2
is solved under the hood transparently now.
In short I am having tremendous trouble finding documentation on the topic.
Hopefully you guys are able to help me along.
2012/9/12 Frank Kootte <[EMAIL PROTECTED]>

> Very interesting concept you mention there - avro projections !
> This sounds indeed like a clever way to leverage the avro capability of
> comparance without deserialisation which will be obviously beneficial.
> Now as with a lot of avro related hadoop topics I am not able to find a
> clear example but from what I did mention to find I would like to get your
> feedback on my question -
>
> Does avro projection involve defining a secondary schema describing only
> the desired subset of fields ?
> Does this then imply that when I define my own AvroKeyComparator<A> the
> byte arrays will only contain the data for set A ?
> How should the BinaryCompare be used differently from the base impl
> in AvroKeyComparator ?
>
> Secondary I've tried to implement a custom AvroKeyComparator and in
> specific the - compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int
> l2)  - method.
> I am wowfully unaware on how to exactly do this and cannot find a lot of
> examples on the topic.
>
> Could you write me a small sample of pseudo code perhaps ?
> Or point me to some documentation to get me on my way ?
>
>
> 2012/9/12 Jacob Metcalf <[EMAIL PROTECTED]>
>
>>  Frank
>>
>> I have spent a bit of time doing this recently but with MR2 and CDH4
>> which may not be appropriate to your use case. However assuming some
>> similarities, I suspect your problem is that you also need to override compare(byte[]
>> b1, int s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator.
>>
>> The advantage to Avro is that Hadoop does not need to deserialize to sort
>> in the shuffle. This function in RawComparator allows Hadoop to quickly
>> compare the bytes directly.
>>
>> Whilst this seems a bit daunting my trick to doing this in MR2 is to
>> leverage Avro's excellent support for projections - subsets of schemas. For
>> example let's say you want to "group" by attribute A but then "sort" by
>> attribute B. In this case I would use a composite key with schema {A, B}
>> and the out of the box AvroKeyComparator as the sort comparator. Then I
>> would implement my own grouping comparator which uses a schema of just {A}
>> then uses the BinaryData function to compare:
>>
>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java
>>
>> I assume you can do something similar in MR1.
>>
>> Regards
>>
>> Jacob
>>
>> > Subject: Secondary sort in hadoop with avro
>> > From: [EMAIL PROTECTED]
>> > Date: Tue, 11 Sep 2012 17:36:06 +0200
>> > To: [EMAIL PROTECTED]
>>
>> >
>> > I need to implement secondary sort within an avro based MR sequence. I
>> however find little to documentation or examples online.
>> > I would like to implement this by overriding the 'int
>> compare(AvroWrapper<T> x, AvroWrapper<T> y)' method but I fail to have it
>> invoked.
>> > Does anybody have experience implementing secondary sort on
>> deserialised avro objects ?
>> >
>> > Some help, advise or pointers will be very much appreciated !
>>
>
>
>
> --
> Mvrgr. Frank
>

--
Mvrgr. Frank
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB