Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # dev >> Questions re integrating Avro into Cascading process


Copy link to this message
-
Questions re integrating Avro into Cascading process
Hi all,

We're looking at creating a Cascading Scheme for Avro, and have got a  
few questions below. These are very general, as this is more of a  
scoping phase (as in, are we crazy to try this) so apologies in  
advance for lack of detail.

For context, Cascading is an open source project that provides a  
workflow API on top of Hadoop. The key unit of data is a tuple, which  
corresponds to a record - you have fields (names) and values.  
Cascading uses a generalized "tap" concept for reading & writing  
tuples, where a tap uses a scheme to handle the low-level mapping from  
Cascading-land to/from the storage format.

So the goal here is to define a Cascading Scheme that will run on  
0.18.3 and later versions of Hadoop, and provide general support for  
reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.

We grabbed the recently committed AvroXXX code from  
org.apache.avro.mapred (thanks Doug & Scott), and began building the  
Cascading scheme to bridge between AvroWrapper<T> keys and Cascading  
tuples.

1. What's the best approach if we want to dynamically define the Avro  
schema, based on a list of field names and types (classes)?

This assumes it's possible to dynamically define & use a schema, of  
course.

2. How much has the new Hadoop map-reduce support code been tested?

3. Will there be issues with running in 0.18.3, 0.19.2, etc?

I saw some discussion about Hadoop using the older Jackson 1.0.1 jar,  
and that then creating problems. Anything else?

4. The key integration point, besides the fields+classes to schema  
issue above, is mapping between Cascading tuples and AvroWrapper<T>

If we're using (I assume) the generic format, any input on how we'd do  
this two-way conversion?

Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
+
Scott Carey 2010-04-16, 18:04
+
Ken Krugler 2010-04-16, 18:20
+
Scott Carey 2010-04-16, 18:28
+
Ken Krugler 2010-04-18, 14:49
+
Doug Cutting 2010-04-21, 22:22
+
Ken Krugler 2010-04-23, 04:40
+
Doug Cutting 2010-04-23, 19:33
+
Ken Krugler 2010-04-26, 19:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB