Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Mixed Avro/Hadoop Writable pipeline

Copy link to this message
Re: Mixed Avro/Hadoop Writable pipeline
Martin Kleppmann 2013-07-03, 23:48
Hi Dan,

You're stepping off the documented path here, but I think that although it
might be a bit of work, it should be possible.

Things to watch out for: you might not be able to use
AvroMapper/AvroReducer so easily, and you may have to mess around with the
job conf a bit (Avro-configured jobs use their own shuffle config with
AvroKeyComparator, which may not be what you want if you're also trying to
use writables). I'd suggest simply reading the code in
org.apache.avro.mapred[uce] -- it's not too complicated.

Whether Avro files or writables (i.e. Hadoop sequence files) are better for
you depends mostly on which format you'd rather have your data in. If you
want to read the data files with something other than Hadoop, Avro is
definitely a good option. Also, Avro data files are self-describing (due to
their embedded schema) which makes them pleasant to use with tools like Pig
and Hive.

On 3 July 2013 10:12, Dan Filimon <[EMAIL PROTECTED]> wrote:

> Hi!
> I'm working on integrating Avro into our data processing pipeline.
> We're using quite a few standard Hadoop and Mahout writables (IntWritable,
> VectorWritable).
> I'm first going to replace the custom Writables with Avro, but in terms of
> the other ones, how important would you say it is to use AvroKey<Integer>
> instead of IntWritable for example?
> The changes will happen gradually but are they even worth it?
> Thanks!