|
|
-
Questions re integrating Avro into Cascading processKen Krugler 2010-04-15, 17:33
Hi all,
We're looking at creating a Cascading Scheme for Avro, and have got a few questions below. These are very general, as this is more of a scoping phase (as in, are we crazy to try this) so apologies in advance for lack of detail. For context, Cascading is an open source project that provides a workflow API on top of Hadoop. The key unit of data is a tuple, which corresponds to a record - you have fields (names) and values. Cascading uses a generalized "tap" concept for reading & writing tuples, where a tap uses a scheme to handle the low-level mapping from Cascading-land to/from the storage format. So the goal here is to define a Cascading Scheme that will run on 0.18.3 and later versions of Hadoop, and provide general support for reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file. We grabbed the recently committed AvroXXX code from org.apache.avro.mapred (thanks Doug & Scott), and began building the Cascading scheme to bridge between AvroWrapper<T> keys and Cascading tuples. 1. What's the best approach if we want to dynamically define the Avro schema, based on a list of field names and types (classes)? This assumes it's possible to dynamically define & use a schema, of course. 2. How much has the new Hadoop map-reduce support code been tested? 3. Will there be issues with running in 0.18.3, 0.19.2, etc? I saw some discussion about Hadoop using the older Jackson 1.0.1 jar, and that then creating problems. Anything else? 4. The key integration point, besides the fields+classes to schema issue above, is mapping between Cascading tuples and AvroWrapper<T> If we're using (I assume) the generic format, any input on how we'd do this two-way conversion? Thanks! -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g +
Scott Carey 2010-04-16, 18:04
+
Ken Krugler 2010-04-16, 18:20
+
Scott Carey 2010-04-16, 18:28
+
Ken Krugler 2010-04-18, 14:49
+
Doug Cutting 2010-04-21, 22:22
+
Ken Krugler 2010-04-23, 04:40
+
Doug Cutting 2010-04-23, 19:33
+
Ken Krugler 2010-04-26, 19:59
|