Mona Gandhi wrote:
> i currently use avro version 1.3.0 to log data. I am having difficulty processing these avro logs via a map reduce job written in Python using hadoop streaming(v 0.21.0).
There is currently no support for Avro data in streaming. One could use
a shell command to convert Avro data to lines of text (e.g., Avro's
'tojson' tool) but that would be rather inefficient.
A good approach would be something akin to Hadoop Pipes: we implement a
Java mapper and reducer that use an Avro protocol to communicate with a
subprocess over standard input and output, transmitting input and output
records as raw binary. The subprocess would deserialize inputs, call
the user-provided mapper or reducer function, then serialize outputs
back. This would require no changes to Hadoop and could be included in
Avro. We'd provide implementations of this protocol for the various
languages, Python, Ruby, C, C++, etc., enabling high-performance
mapreduce programs over Avro data for all of these.
The existing Hadoop Pipes implementation would be a good starting point
for this work, as it already uses the same technique, although with a
Hadoop Writable-based protocol and with only a C++ implementation.
I've filed an issue in Jira to track this:
I might have a chance to work on this later this month.