Questions re integrating Avro into Cascading process
Ken Krugler 2010-04-15, 17:33
Hi all,

We're looking at creating a Cascading Scheme for Avro, and have got a  
few questions below. These are very general, as this is more of a  
scoping phase (as in, are we crazy to try this) so apologies in  
advance for lack of detail.

For context, Cascading is an open source project that provides a  
workflow API on top of Hadoop. The key unit of data is a tuple, which  
corresponds to a record - you have fields (names) and values.  
Cascading uses a generalized "tap" concept for reading & writing  
tuples, where a tap uses a scheme to handle the low-level mapping from  
Cascading-land to/from the storage format.

So the goal here is to define a Cascading Scheme that will run on  
0.18.3 and later versions of Hadoop, and provide general support for  
reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.

We grabbed the recently committed AvroXXX code from  
org.apache.avro.mapred (thanks Doug & Scott), and began building the  
Cascading scheme to bridge between AvroWrapper<T> keys and Cascading  

1. What's the best approach if we want to dynamically define the Avro  
schema, based on a list of field names and types (classes)?

This assumes it's possible to dynamically define & use a schema, of  

2. How much has the new Hadoop map-reduce support code been tested?

3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,  
and that then creating problems. Anything else?

4. The key integration point, besides the fields+classes to schema  
issue above, is mapping between Cascading tuples and AvroWrapper<T>

If we're using (I assume) the generic format, any input on how we'd do  
this two-way conversion?


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g
Scott Carey 2010-04-16, 18:04
Ken Krugler 2010-04-16, 18:20
Scott Carey 2010-04-16, 18:28
Ken Krugler 2010-04-18, 14:49
Doug Cutting 2010-04-21, 22:22
Ken Krugler 2010-04-23, 04:40
Doug Cutting 2010-04-23, 19:33
Ken Krugler 2010-04-26, 19:59