I ran into an issue with grouping comparators today when using the Avro
"new" mapreduce API. (I'll say "old" to refer to mapred.* and "new" to
refer to mapreduce.* and hope this is clear to everybody!)
Hadoop applies the grouping comparator differently between the old and new
APIs. In old API jobs, the comparator is called using the compare(x,y)
interface, whereas in new API jobs it is called using the RawComparator
interface compare(byte b1, int s1, int l1, byte b2, int s2, int l2).
In standard mapreduce, this isn't an issue because if you haven't provided
a custom implementation of RawComparator, Hadoop does it for you behind the
scenes by deserializing the writables and calling your compare(x,y)
Avro doesn't do this, it has a default binary comparison which it uses
instead (and of course I'm not saying this wrong, it's just different).
What this means is, if you want to use a grouping comparator with the "new"
mapreduce API and Avro, you absolutely must provide an implementation of
RawComparator, or do the deserialization and delegating method call to
compare(x,y) yourself. This really isn't obvious and I haven't found it to
be documented anywhere.
So my question I suppose is, what, if anything, can be done about this?
Diagnosing problems with reduce value grouping is in my opinion one of the
really tricky parts of Hadoop development, as no amount of unit testing
will help you and the jobs appear to work - it's just that the results are
very often incorrect.