-Re: Mixed Avro/Hadoop Writable pipeline
Pradeep Gollakota 2013-07-04, 10:02
Not sure about Avro<Integer> is 4 bytes or not. But IntWritable is variable
length. If the number can be represented in less than 4 bytes, it will.
On Jul 4, 2013 2:22 AM, "Dan Filimon" <[EMAIL PROTECTED]> wrote:
> Well, I got it working eventually. :)
> First of all, I'll mention that I'm using the new MapReduce API, so no
> AvroMapper/AvroReducer voodoo for me. I'm just using the AvroKey<> and
> AvroValue<> wrappers and once I set the right properties using AvroJob's
> static methods (AvroJob.setMapOutputValueSchema() for example) and set the
> input to be an AvroKeyInputFormat, everything worked out fine.
> About the writables, I'm interested to know whether it'd be better to use
> Avro equivalent classes: AvroKey<Integer> or IntWritable. I assume the
> speed/size of these two should be the same 4 bytes?
> On Thu, Jul 4, 2013 at 2:48 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
>> Hi Dan,
>> You're stepping off the documented path here, but I think that although
>> it might be a bit of work, it should be possible.
>> Things to watch out for: you might not be able to use
>> AvroMapper/AvroReducer so easily, and you may have to mess around with the
>> job conf a bit (Avro-configured jobs use their own shuffle config with
>> AvroKeyComparator, which may not be what you want if you're also trying to
>> use writables). I'd suggest simply reading the code in
>> org.apache.avro.mapred[uce] -- it's not too complicated.
>> Whether Avro files or writables (i.e. Hadoop sequence files) are better
>> for you depends mostly on which format you'd rather have your data in. If
>> you want to read the data files with something other than Hadoop, Avro is
>> definitely a good option. Also, Avro data files are self-describing (due to
>> their embedded schema) which makes them pleasant to use with tools like Pig
>> and Hive.
>> On 3 July 2013 10:12, Dan Filimon <[EMAIL PROTECTED]> wrote:
>>> I'm working on integrating Avro into our data processing pipeline.
>>> We're using quite a few standard Hadoop and Mahout writables
>>> (IntWritable, VectorWritable).
>>> I'm first going to replace the custom Writables with Avro, but in terms
>>> of the other ones, how important would you say it is to use
>>> AvroKey<Integer> instead of IntWritable for example?
>>> The changes will happen gradually but are they even worth it?