-Re: Mixed Avro/Hadoop Writable pipeline
Dan Filimon 2013-07-04, 06:21
Well, I got it working eventually. :)
First of all, I'll mention that I'm using the new MapReduce API, so no
AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
AvroValue<> wrappers and once I set the right properties using AvroJob's
static methods (AvroJob.setMapOutputValueSchema() for example) and set the
input to be an AvroKeyInputFormat, everything worked out fine.
About the writables, I'm interested to know whether it'd be better to use
Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
speed/size of these two should be the same 4 bytes?
On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
> Hi Dan,
> You're stepping off the documented path here, but I think that although it
> might be a bit of work, it should be possible.
> Things to watch out for: you might not be able to use
> AvroMapper/AvroReducer so easily, and you may have to mess around with the
> job conf a bit (Avro-configured jobs use their own shuffle config with
> AvroKeyComparator, which may not be what you want if you're also trying to
> use writables). I'd suggest simply reading the code in
> org.apache.avro.mapred[uce] -- it's not too complicated.
> Whether Avro files or writables (i.e. Hadoop sequence files) are better
> for you depends mostly on which format you'd rather have your data in. If
> you want to read the data files with something other than Hadoop, Avro is
> definitely a good option. Also, Avro data files are self-describing (due to
> their embedded schema) which makes them pleasant to use with tools like Pig
> and Hive.
> On 3 July 2013 10:12, Dan Filimon <[EMAIL PROTECTED]> wrote:
>> I'm working on integrating Avro into our data processing pipeline.
>> We're using quite a few standard Hadoop and Mahout writables
>> (IntWritable, VectorWritable).
>> I'm first going to replace the custom Writables with Avro, but in terms
>> of the other ones, how important would you say it is to use
>> AvroKey<Integer> instead of IntWritable for example?
>> The changes will happen gradually but are they even worth it?