Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Getting started with Avro + Reading from an Avro formatted file


Copy link to this message
-
Re: Getting started with Avro + Reading from an Avro formatted file
> I want to be able to read from an Avro formatted log file (specifically the History Log file created at the end of a Hadoop job) and create a Comma Separated file of certain log entries. I need a csv file because this is the format that is accepted by post processing software I am working with (eg: Matlab).
>
> Initially I was using a BASH script to grep and awk from this file and create my CSV file because I needed a very few values from it, and a quick script just worked. I didn't try to get to know what format the log file was in and utilize that. (my bad!)  Now that I need to be scaling up and want to have a reliable way to parse, I would like to try and do it the right way.
>
> My question is this: For the above goal, could you please guide me with steps I can follow - such as reading material and libraries I could try to use. As I go through the Quick Start Guide and FAQ, I see that a lot of the information here is geared to someone who wants to use the data serialization and RPC functionality provided by Avro. Given that I only want to be able to "read", where may I start?
>
> I can comfortably script with BASH and Perl. Given that I only see support for Java, Python and Ruby, I think I can take this as as opportunity to learn Python and get up to speed.

You could also take a look at the C bindings.  We've recently added a couple of command-line tools for outputting the contents of an Avro file to stdout: avrocat and avropipe.  avrocat outputs each record in an Avro file on a single line, using the JSON encoding defined by the Avro spec [1].  avropipe produces a separate line for each “field” in each record; its output is (roughly speaking) what you'd get from piping the JSON encoding of each record through the jsonpipe [2] tool.  (Technically speaking, it's what you get from putting all of the records into a JSON array, and sending that array through jsonpipe.)

[1] http://avro.apache.org/docs/current/spec.html#json_encoding
[2] https://github.com/dvxhouse/jsonpipe

So, with the example quickstop.db file, the avrocat gives you:

  $ avrocat examples/quickstop.db | head
  {"ID": 1, "First": "Dante", "Last": "Hicks", "Phone": "(0)", "Age": 32}
  {"ID": 2, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30}
  {"ID": 3, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28}
  {"ID": 4, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27}
  {"ID": 5, "First": "Bob", "Last": "Silent", "Phone": "(555) 123-6422", "Age": 29}
  {"ID": 6, "First": "Jay", "Last": "???", "Phone": "(0)", "Age": 26}
  {"ID": 7, "First": "Dante", "Last": "Hicks", "Phone": "(1)", "Age": 32}
  {"ID": 8, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30}
  {"ID": 9, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28}
  {"ID": 10, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27}

While avropipe gives you:

  $ avropipe examples/quickstop.db | head -n 25
  / []
  /0 {}
  /0/ID 1
  /0/First "Dante\u0000"
  /0/Last "Hicks\u0000"
  /0/Phone "(0)\u0000"
  /0/Age 32
  /1 {}
  /1/ID 2
  /1/First "Randal\u0000"
  /1/Last "Graves\u0000"
  /1/Phone "(555) 123-5678\u0000"
  /1/Age 30
  /2 {}
  /2/ID 3
  /2/First "Veronica\u0000"
  /2/Last "Loughran\u0000"
  /2/Phone "(555) 123-0987\u0000"
  /2/Age 28
  /3 {}
  /3/ID 4
  /3/First "Caitlin\u0000"
  /3/Last "Bree\u0000"
  /3/Phone "(555) 123-2323\u0000"
  /3/Age 27

Although I'm seeing a bug there, since those NUL terminators shouldn't appear in the output.  I'm going to open a ticket for that and fix it real quick.  But, these tools might be exactly what you need, especially since the C bindings don't have any library dependencies to install.

cheers
–doug
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB