Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # dev >> Questions re integrating Avro into Cascading process


+
Ken Krugler 2010-04-15, 17:33
+
Scott Carey 2010-04-16, 18:04
+
Ken Krugler 2010-04-16, 18:20
+
Scott Carey 2010-04-16, 18:28
+
Ken Krugler 2010-04-18, 14:49
+
Doug Cutting 2010-04-21, 22:22
Copy link to this message
-
Re: Questions re integrating Avro into Cascading process

On Apr 21, 2010, at 3:22pm, Doug Cutting wrote:

> Ken Krugler wrote:
>> One open issue - it would be great to be able to set metadata in  
>> the headers of the resulting Avro files. But it wasn't obvious how  
>> to do that, given our (intentionally) arms-length approach via the  
>> use of the Avro mapred code.
>> One idea would be to have job conf values using keys prefixed with  
>> avro.metadata.xxx, and the Avro mapred support could automagically  
>> use that when creating the file. But this would break our goal of  
>> using unmodified Avro source, so I'm curious whether support for  
>> setting the file metadata would also be useful for the standard  
>> (Hadoop) use of Avro for an output format, and if so, whether there  
>> was a better approach.
>
> Embedding the metadata in the configuration seems like a good  
> approach.  Please file a Jira issue for this and attach a patch.
>
> AvroOutputFormat can add properties named  
> avro.mapred.output.metadata.*.  We'll have to enumerate all  
> properties in the job and test for this prefix, since Configuration  
> is a HashMap, but the alternative of encoding the metadata map in a  
> single configuration value seems no more attractive.
>
> Note that https://issues.apache.org/jira/browse/HADOOP-6420 added  
> support for adding maps to configuration, but the extracted map  
> cannot be enumerated, so could not be added to the DataFileWriter's  
> metadata. Also, this feature is perhaps slated for removal as a part  
> of https://issues.apache.org/jira/browse/HADOOP-6698, but its code  
> might prove useful as a starting point.

Thanks for the info, we'll work up a patch & file the issue when it's  
ready.

Two related questions:

1. I'm assuming there's no compelling reason to read the file headers  
- in fact, not sure how you'd even get at the data, much less how  
you'd deal with potentially partial/missing data from a set of Avro  
files being read as part files.

2. We'd like to not include Avro source in the Cascading scheme  
project, but rather just have a dependency on the Avro jar.

We have a similar relationship between Bixo and Tika, and what's  
worked well is for the Bixo master branch to have a dependency on the  
Tika snapshot builds, so we can quickly iterate on both projects.

So are there plans to start pushing Avro snapshot builds to the Apache  
snapshots repository? I see occasional Avro releases to the Maven  
central repo (1.0, 1.2, 1.3.2) but nothing for snapshots.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
+
Doug Cutting 2010-04-23, 19:33
+
Ken Krugler 2010-04-26, 19:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB