Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # dev - Questions re integrating Avro into Cascading process


+
Ken Krugler 2010-04-15, 17:33
+
Scott Carey 2010-04-16, 18:04
+
Ken Krugler 2010-04-16, 18:20
+
Scott Carey 2010-04-16, 18:28
Copy link to this message
-
Questions re integrating Avro into Cascading process
Ken Krugler 2010-04-18, 14:49
Hi all,

> We're looking at creating a Cascading Scheme for Avro, and have got  
> a few questions below. These are very general, as this is more of a  
> scoping phase (as in, are we crazy to try this) so apologies in  
> advance for lack of detail.
>
> For context, Cascading is an open source project that provides a  
> workflow API on top of Hadoop. The key unit of data is a tuple,  
> which corresponds to a record - you have fields (names) and values.  
> Cascading uses a generalized "tap" concept for reading & writing  
> tuples, where a tap uses a scheme to handle the low-level mapping  
> from Cascading-land to/from the storage format.
>
> So the goal here is to define a Cascading Scheme that will run on  
> 0.18.3 and later versions of Hadoop, and provide general support for  
> reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.
>
> We grabbed the recently committed AvroXXX code from  
> org.apache.avro.mapred (thanks Doug & Scott), and began building the  
> Cascading scheme to bridge between AvroWrapper<T> keys and Cascading  
> tuples.

An update on status - there's a working Cascading tap at http://github.com/bixolabs/cascading.avro
. See the README (http://github.com/bixolabs/cascading.avro/blob/master/README
) for more details.

One open issue - it would be great to be able to set metadata in the  
headers of the resulting Avro files. But it wasn't obvious how to do  
that, given our (intentionally) arms-length approach via the use of  
the Avro mapred code.

One idea would be to have job conf values using keys prefixed with  
avro.metadata.xxx, and the Avro mapred support could automagically use  
that when creating the file. But this would break our goal of using  
unmodified Avro source, so I'm curious whether support for setting the  
file metadata would also be useful for the standard (Hadoop) use of  
Avro for an output format, and if so, whether there was a better  
approach.

Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
+
Doug Cutting 2010-04-21, 22:22
+
Ken Krugler 2010-04-23, 04:40
+
Doug Cutting 2010-04-23, 19:33
+
Ken Krugler 2010-04-26, 19:59