-Re: Questions re integrating Avro into Cascading process
Ken Krugler 2010-04-23, 04:40
On Apr 21, 2010, at 3:22pm, Doug Cutting wrote:
> Ken Krugler wrote:
>> One open issue - it would be great to be able to set metadata in
>> the headers of the resulting Avro files. But it wasn't obvious how
>> to do that, given our (intentionally) arms-length approach via the
>> use of the Avro mapred code.
>> One idea would be to have job conf values using keys prefixed with
>> avro.metadata.xxx, and the Avro mapred support could automagically
>> use that when creating the file. But this would break our goal of
>> using unmodified Avro source, so I'm curious whether support for
>> setting the file metadata would also be useful for the standard
>> (Hadoop) use of Avro for an output format, and if so, whether there
>> was a better approach.
> Embedding the metadata in the configuration seems like a good
> approach. Please file a Jira issue for this and attach a patch.
> AvroOutputFormat can add properties named
> avro.mapred.output.metadata.*. We'll have to enumerate all
> properties in the job and test for this prefix, since Configuration
> is a HashMap, but the alternative of encoding the metadata map in a
> single configuration value seems no more attractive.
> Note that https://issues.apache.org/jira/browse/HADOOP-6420 added
> support for adding maps to configuration, but the extracted map
> cannot be enumerated, so could not be added to the DataFileWriter's
> metadata. Also, this feature is perhaps slated for removal as a part
> of https://issues.apache.org/jira/browse/HADOOP-6698, but its code
> might prove useful as a starting point.
Thanks for the info, we'll work up a patch & file the issue when it's
Two related questions:
1. I'm assuming there's no compelling reason to read the file headers
- in fact, not sure how you'd even get at the data, much less how
you'd deal with potentially partial/missing data from a set of Avro
files being read as part files.
2. We'd like to not include Avro source in the Cascading scheme
project, but rather just have a dependency on the Avro jar.
We have a similar relationship between Bixo and Tika, and what's
worked well is for the Bixo master branch to have a dependency on the
Tika snapshot builds, so we can quickly iterate on both projects.
So are there plans to start pushing Avro snapshot builds to the Apache
snapshots repository? I see occasional Avro releases to the Maven
central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.
e l a s t i c w e b m i n i n g