Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> json2avro

I wrote a little C tool using Avro-C to convert JSON to Avro and thought
may be someone here may find it useful.


The purpose is to be useful in converting messy legacy JSON in which
some elements might be missing or of wrong type. Even though there is no
schema resolution per se here, json2avro will attempt to use the default
specified in the schema if the corresponding JSON element is missing and
will attempt to try the types specified in a union until one succeeds.

json2avro lets you pick from null, snappy, deflate and lzma codecs,
specify a custom block size and optionally skips over JSON lines that it
is unable to parse. I'm also thinking of adding a target max file size so
that it would automatically split output into multiple sizes.

It uses Jansson as the JSON parser which is conveniently bundled with
Avro-C. (One thing that I'm not clear on is that Jansson cannot handle
nulls, not sure if this is a Jansson-specific limitation or something
inherent to JSON.)

This is rather simple code (no tests, and not even a "make install" yet)
and lacks support for some features, namely enums and aliases, but it's
good enough to be useful. It does seem pretty fast, slightly faster than
the avro-tools fromjson option (though my tests were hardly scientific).

Enjoy, and any feedback is very much appreciated!