Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> Questions re integrating Avro into Cascading process


Copy link to this message
-
Re: Questions re integrating Avro into Cascading process

On Apr 21, 2010, at 3:22pm, Doug Cutting wrote:

> Ken Krugler wrote:
>> One open issue - it would be great to be able to set metadata in  
>> the headers of the resulting Avro files. But it wasn't obvious how  
>> to do that, given our (intentionally) arms-length approach via the  
>> use of the Avro mapred code.
>> One idea would be to have job conf values using keys prefixed with  
>> avro.metadata.xxx, and the Avro mapred support could automagically  
>> use that when creating the file. But this would break our goal of  
>> using unmodified Avro source, so I'm curious whether support for  
>> setting the file metadata would also be useful for the standard  
>> (Hadoop) use of Avro for an output format, and if so, whether there  
>> was a better approach.
>
> Embedding the metadata in the configuration seems like a good  
> approach.  Please file a Jira issue for this and attach a patch.
>
> AvroOutputFormat can add properties named  
> avro.mapred.output.metadata.*.  We'll have to enumerate all  
> properties in the job and test for this prefix, since Configuration  
> is a HashMap, but the alternative of encoding the metadata map in a  
> single configuration value seems no more attractive.
>
> Note that https://issues.apache.org/jira/browse/HADOOP-6420 added  
> support for adding maps to configuration, but the extracted map  
> cannot be enumerated, so could not be added to the DataFileWriter's  
> metadata. Also, this feature is perhaps slated for removal as a part  
> of https://issues.apache.org/jira/browse/HADOOP-6698, but its code  
> might prove useful as a starting point.

Thanks for the info, we'll work up a patch & file the issue when it's  
ready.

Two related questions:

1. I'm assuming there's no compelling reason to read the file headers  
- in fact, not sure how you'd even get at the data, much less how  
you'd deal with potentially partial/missing data from a set of Avro  
files being read as part files.

2. We'd like to not include Avro source in the Cascading scheme  
project, but rather just have a dependency on the Avro jar.

We have a similar relationship between Bixo and Tika, and what's  
worked well is for the Bixo master branch to have a dependency on the  
Tika snapshot builds, so we can quickly iterate on both projects.

So are there plans to start pushing Avro snapshot builds to the Apache  
snapshots repository? I see occasional Avro releases to the Maven  
central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g