|
|
-
Re: Q about use of avro logs with hadoop streamingDoug Cutting 2010-04-06, 18:32
Mona Gandhi wrote:
> i currently use avro version 1.3.0 to log data. I am having difficulty processing these avro logs via a map reduce job written in Python using hadoop streaming(v 0.21.0). Mona, There is currently no support for Avro data in streaming. One could use a shell command to convert Avro data to lines of text (e.g., Avro's 'tojson' tool) but that would be rather inefficient. A good approach would be something akin to Hadoop Pipes: we implement a Java mapper and reducer that use an Avro protocol to communicate with a subprocess over standard input and output, transmitting input and output records as raw binary. The subprocess would deserialize inputs, call the user-provided mapper or reducer function, then serialize outputs back. This would require no changes to Hadoop and could be included in Avro. We'd provide implementations of this protocol for the various languages, Python, Ruby, C, C++, etc., enabling high-performance mapreduce programs over Avro data for all of these. The existing Hadoop Pipes implementation would be a good starting point for this work, as it already uses the same technique, although with a Hadoop Writable-based protocol and with only a C++ implementation. I've filed an issue in Jira to track this: https://issues.apache.org/jira/browse/AVRO-512 I might have a chance to work on this later this month. Doug |