Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Getting started with Avro + Reading from an Avro formatted file

selvi k 2012-01-24, 15:31
Copy link to this message
Re: Getting started with Avro + Reading from an Avro formatted file
> I want to be able to read from an Avro formatted log file (specifically the History Log file created at the end of a Hadoop job) and create a Comma Separated file of certain log entries. I need a csv file because this is the format that is accepted by post processing software I am working with (eg: Matlab).
> Initially I was using a BASH script to grep and awk from this file and create my CSV file because I needed a very few values from it, and a quick script just worked. I didn't try to get to know what format the log file was in and utilize that. (my bad!)  Now that I need to be scaling up and want to have a reliable way to parse, I would like to try and do it the right way.
> My question is this: For the above goal, could you please guide me with steps I can follow - such as reading material and libraries I could try to use. As I go through the Quick Start Guide and FAQ, I see that a lot of the information here is geared to someone who wants to use the data serialization and RPC functionality provided by Avro. Given that I only want to be able to "read", where may I start?
> I can comfortably script with BASH and Perl. Given that I only see support for Java, Python and Ruby, I think I can take this as as opportunity to learn Python and get up to speed.

You could also take a look at the C bindings.  We've recently added a couple of command-line tools for outputting the contents of an Avro file to stdout: avrocat and avropipe.  avrocat outputs each record in an Avro file on a single line, using the JSON encoding defined by the Avro spec [1].  avropipe produces a separate line for each “field” in each record; its output is (roughly speaking) what you'd get from piping the JSON encoding of each record through the jsonpipe [2] tool.  (Technically speaking, it's what you get from putting all of the records into a JSON array, and sending that array through jsonpipe.)

[1] http://avro.apache.org/docs/current/spec.html#json_encoding
[2] https://github.com/dvxhouse/jsonpipe

So, with the example quickstop.db file, the avrocat gives you:

  $ avrocat examples/quickstop.db | head
  {"ID": 1, "First": "Dante", "Last": "Hicks", "Phone": "(0)", "Age": 32}
  {"ID": 2, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30}
  {"ID": 3, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28}
  {"ID": 4, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27}
  {"ID": 5, "First": "Bob", "Last": "Silent", "Phone": "(555) 123-6422", "Age": 29}
  {"ID": 6, "First": "Jay", "Last": "???", "Phone": "(0)", "Age": 26}
  {"ID": 7, "First": "Dante", "Last": "Hicks", "Phone": "(1)", "Age": 32}
  {"ID": 8, "First": "Randal", "Last": "Graves", "Phone": "(555) 123-5678", "Age": 30}
  {"ID": 9, "First": "Veronica", "Last": "Loughran", "Phone": "(555) 123-0987", "Age": 28}
  {"ID": 10, "First": "Caitlin", "Last": "Bree", "Phone": "(555) 123-2323", "Age": 27}

While avropipe gives you:

  $ avropipe examples/quickstop.db | head -n 25
  / []
  /0 {}
  /0/ID 1
  /0/First "Dante\u0000"
  /0/Last "Hicks\u0000"
  /0/Phone "(0)\u0000"
  /0/Age 32
  /1 {}
  /1/ID 2
  /1/First "Randal\u0000"
  /1/Last "Graves\u0000"
  /1/Phone "(555) 123-5678\u0000"
  /1/Age 30
  /2 {}
  /2/ID 3
  /2/First "Veronica\u0000"
  /2/Last "Loughran\u0000"
  /2/Phone "(555) 123-0987\u0000"
  /2/Age 28
  /3 {}
  /3/ID 4
  /3/First "Caitlin\u0000"
  /3/Last "Bree\u0000"
  /3/Phone "(555) 123-2323\u0000"
  /3/Age 27

Although I'm seeing a bug there, since those NUL terminators shouldn't appear in the output.  I'm going to open a ticket for that and fix it real quick.  But, these tools might be exactly what you need, especially since the C bindings don't have any library dependencies to install.

Harsh J 2012-01-24, 16:01
selvi k 2012-01-24, 19:37
selvi k 2012-01-24, 20:20
Harsh J 2012-01-24, 20:44
selvi k 2012-01-25, 02:46
Douglas Creager 2012-01-24, 21:00
selvi k 2012-01-25, 02:50
Harsh J 2012-01-24, 21:06
selvi k 2012-01-25, 02:56