Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Go library


Copy link to this message
-
Re: Go library
Cool.  Thanks for the response.

Quick update:

I've had early success reading avro files with the avro c library and
Go through cgo.  It was relatively straight forward.  It's a tad
tedious as the new "value" interface on the C library uses a lot of
macros, and cgo cannot (AFAIK) call macros directly.  Rather, I needed
to create C-wrapper functions for all the macros.  I did this for
about 8 or so macros (just the ones I needed as a proof of concept,
but it included most everything you'd expect on the reading side
including generic readers, retrieving writer schema, iterating over
record values, teasing out unions/disciriment branches, retrieving
strings & long values, get field by index and by name, corresponding
incref/decref, and generic readers,).  Aside from the macros,
integrating with C from Go is straight forward and, with some quick
tests, seems to be comparable in performance to C.

I have tested performance using a simple script that reads through an
Avro file, extracts two fields (string and long), and sums up the
longs across all records (strings are just dropped to the floor).  I
tested with a ~900M avro file (compressed blocks) that has about 25M
records.  On my machine, the simple C library I built runs through it
in about 42seconds.  The Go library I have that essentially does the
same thing with Go/Cgo accomplishes the same task in about 51 seconds.
 A more common (in my domain) sized input (~270M avro file) containing
~7.5M records runs ~15s C and ~18s in Go).   We regularly process 100s
of files of that size/shape.   This is not taking advantage of any of
the Go concurrency routines / etc. and the Go code is largely just the
C code in Go clothing.  But i was pleased to see pretty negligible
overhead.

Looking down the road, an idiomatic library should follow a similar
pattern to the Go "encoding/json" package.   That shouldn't be too
difficult.  They only real barrier is time ;-)   I currently have a
task at hand and have enough pieces to accomplish it.   I will circle
back on this though as I get a little more comfort with Go idioms and
idiosyncrasies.

I wanted to share the above though as I view these quick results as promising.

p.s. I also tested using C to convert a record to a json *char and
pass that to a go function that unmarshals it into a Go struct.  this
worked fine, but, as one would expects, adds a considerable amount of
overhead - 12 minutes for the same 52 second test noted above.  it
does work though for a quick approach.

On Mar 20, 2014 4:33 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB