Anyone know of any avro libraries for go? I haven't had much luck finding anything. Either Cgo or pure go is fine by me. I'm a long time user of avro and have a considerable amount of data in it. (Avro is our serialization format of choice for all archive data, event logs, and other data stored on s3, and in hdfs). Go is quickly becoming a core technology in our stack as well and avro support is one of the impeding areas for wider adoption.
Worse case scenario this may be something I take on. I'd much rather pick up where someone else left of though. I dont need any RPC functionality. Just read/write (with compression support).
I've had early success reading avro files with the avro c library and Go through cgo. It was relatively straight forward. It's a tad tedious as the new "value" interface on the C library uses a lot of macros, and cgo cannot (AFAIK) call macros directly. Rather, I needed to create C-wrapper functions for all the macros. I did this for about 8 or so macros (just the ones I needed as a proof of concept, but it included most everything you'd expect on the reading side including generic readers, retrieving writer schema, iterating over record values, teasing out unions/disciriment branches, retrieving strings & long values, get field by index and by name, corresponding incref/decref, and generic readers,). Aside from the macros, integrating with C from Go is straight forward and, with some quick tests, seems to be comparable in performance to C.
I have tested performance using a simple script that reads through an Avro file, extracts two fields (string and long), and sums up the longs across all records (strings are just dropped to the floor). I tested with a ~900M avro file (compressed blocks) that has about 25M records. On my machine, the simple C library I built runs through it in about 42seconds. The Go library I have that essentially does the same thing with Go/Cgo accomplishes the same task in about 51 seconds. A more common (in my domain) sized input (~270M avro file) containing ~7.5M records runs ~15s C and ~18s in Go). We regularly process 100s of files of that size/shape. This is not taking advantage of any of the Go concurrency routines / etc. and the Go code is largely just the C code in Go clothing. But i was pleased to see pretty negligible overhead.
Looking down the road, an idiomatic library should follow a similar pattern to the Go "encoding/json" package. That shouldn't be too difficult. They only real barrier is time ;-) I currently have a task at hand and have enough pieces to accomplish it. I will circle back on this though as I get a little more comfort with Go idioms and idiosyncrasies.
I wanted to share the above though as I view these quick results as promising.
p.s. I also tested using C to convert a record to a json *char and pass that to a go function that unmarshals it into a Go struct. this worked fine, but, as one would expects, adds a considerable amount of overhead - 12 minutes for the same 52 second test noted above. it does work though for a quick approach.
On Mar 20, 2014 4:33 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext