Ken Krugler 2010-04-15, 17:33
Scott Carey 2010-04-16, 18:04
Ken Krugler 2010-04-16, 18:20
Scott Carey 2010-04-16, 18:28
> We're looking at creating a Cascading Scheme for Avro, and have got
> a few questions below. These are very general, as this is more of a
> scoping phase (as in, are we crazy to try this) so apologies in
> advance for lack of detail.
> For context, Cascading is an open source project that provides a
> workflow API on top of Hadoop. The key unit of data is a tuple,
> which corresponds to a record - you have fields (names) and values.
> Cascading uses a generalized "tap" concept for reading & writing
> tuples, where a tap uses a scheme to handle the low-level mapping
> from Cascading-land to/from the storage format.
> So the goal here is to define a Cascading Scheme that will run on
> 0.18.3 and later versions of Hadoop, and provide general support for
> reading/writing tuples from/to an Avro-format Hadoop part-xxxxx file.
> We grabbed the recently committed AvroXXX code from
> org.apache.avro.mapred (thanks Doug & Scott), and began building the
> Cascading scheme to bridge between AvroWrapper<T> keys and Cascading
An update on status - there's a working Cascading tap at http://github.com/bixolabs/cascading.avro
. See the README (http://github.com/bixolabs/cascading.avro/blob/master/README
) for more details.
One open issue - it would be great to be able to set metadata in the
headers of the resulting Avro files. But it wasn't obvious how to do
that, given our (intentionally) arms-length approach via the use of
the Avro mapred code.
One idea would be to have job conf values using keys prefixed with
avro.metadata.xxx, and the Avro mapred support could automagically use
that when creating the file. But this would break our goal of using
unmodified Avro source, so I'm curious whether support for setting the
file metadata would also be useful for the standard (Hadoop) use of
Avro for an output format, and if so, whether there was a better
e l a s t i c w e b m i n i n g
Doug Cutting 2010-04-21, 22:22
Ken Krugler 2010-04-23, 04:40
Doug Cutting 2010-04-23, 19:33
Ken Krugler 2010-04-26, 19:59