|
|
-
Re: Questions re integrating Avro into Cascading processKen Krugler 2010-04-23, 04:40
On Apr 21, 2010, at 3:22pm, Doug Cutting wrote: > Ken Krugler wrote: >> One open issue - it would be great to be able to set metadata in >> the headers of the resulting Avro files. But it wasn't obvious how >> to do that, given our (intentionally) arms-length approach via the >> use of the Avro mapred code. >> One idea would be to have job conf values using keys prefixed with >> avro.metadata.xxx, and the Avro mapred support could automagically >> use that when creating the file. But this would break our goal of >> using unmodified Avro source, so I'm curious whether support for >> setting the file metadata would also be useful for the standard >> (Hadoop) use of Avro for an output format, and if so, whether there >> was a better approach. > > Embedding the metadata in the configuration seems like a good > approach. Please file a Jira issue for this and attach a patch. > > AvroOutputFormat can add properties named > avro.mapred.output.metadata.*. We'll have to enumerate all > properties in the job and test for this prefix, since Configuration > is a HashMap, but the alternative of encoding the metadata map in a > single configuration value seems no more attractive. > > Note that https://issues.apache.org/jira/browse/HADOOP-6420 added > support for adding maps to configuration, but the extracted map > cannot be enumerated, so could not be added to the DataFileWriter's > metadata. Also, this feature is perhaps slated for removal as a part > of https://issues.apache.org/jira/browse/HADOOP-6698, but its code > might prove useful as a starting point. Thanks for the info, we'll work up a patch & file the issue when it's ready. Two related questions: 1. I'm assuming there's no compelling reason to read the file headers - in fact, not sure how you'd even get at the data, much less how you'd deal with potentially partial/missing data from a set of Avro files being read as part files. 2. We'd like to not include Avro source in the Cascading scheme project, but rather just have a dependency on the Avro jar. We have a similar relationship between Bixo and Tika, and what's worked well is for the Bixo master branch to have a dependency on the Tika snapshot builds, so we can quickly iterate on both projects. So are there plans to start pushing Avro snapshot builds to the Apache snapshots repository? I see occasional Avro releases to the Maven central repo (1.0, 1.2, 1.3.2) but nothing for snapshots. Thanks, -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g |