Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Converting arbitrary JSON to avro

Markus Strickler 2012-09-17, 16:40
Doug Cutting 2012-09-17, 17:35
Copy link to this message
Re: Converting arbitrary JSON to avro
Markus Strickler 2012-09-18, 18:34
Hi Doug,

thank you for your detailed explanation.

Json.Writer is indeed what I had in mind and I have successfully managed to convert my existing JSON to avro using it.
However using GenericDatumReader on this feels pretty unnatural, as I seem to be unable to access fields directly. It seems I have to access the "value" field on each record which returns a Map which uses Utf8 Objects as keys for the actual fields. Or am I doing something wrong here?

Concerning the more specific schema, you are of course completely right. Unfortunately more or less all the fields in the JSON data format are optional and many have substructures, so, at least in my understanding, I have to use unions of null and the actual type throughout the schema. I tried using JsonDecoder first (or rather the fromjson option of the avro tool, which, I think, uses JsonDecoder) but given the current JSON structures, this didn't work.

So I'll probably have to look into implementing my own converter.  However given the rather complex structure of the original JSON I'm wondering if trying to represent the data in avro is such a good idea in the first place.

Again, thanks a lot for your help,
Am 17.09.2012 um 19:35 schrieb Doug Cutting:

> On Mon, Sep 17, 2012 at 9:40 AM, Markus Strickler <[EMAIL PROTECTED]> wrote:
>> I'm currently trying to convert already existing JSON (not generated by avro) to avro and am wondering if there is some generic way to do this (maybe an avro schema that matches arbitrary JSON)?
> Yes, there is support for reading and writing arbitrary Json data as Avro:
>  http://avro.apache.org/docs/current/api/java/org/apache/avro/data/Json.html
> Json.Writer will take Json data that's been parsed into Jackson's
> JsonNode representation and write it as Avro data using the schema
> Json.SCHEMA, and Json.Reader will read Avro data written with this
> Schema into a JsonNode.  Note that just because you wrote the data
> with Json.Writer doesn't mean you need to read it with Json.Reader.
> You could instead read it with GenericDatumReader, from MapReduce or
> Hive.
> However using a more-specific schema than Json.SCHEMA will result in a
> smaller and faster Avro encoding for your data.  It's also likely to
> result in a schema that much better describes your data for use in
> Pig, Hive, etc.
> If all of your records are of the same schema, and that schema doesn't
> have unions (i.e., a given field always has values of the same type,
> all objects have the same set of fields, fully populated) then you may
> be able to use Avro's JsonDecoder.  Note however that Avro's
> JsonEncoder/JsonDecoder are not generally appropriate for arbitrary
> Json, but rather are intended to represent Avro data as Json.  (Unions
> are the biggest difference.  Avro's Json encoding  uses a Json object
> to tag each union value with the intended type.  For example, an Avro
> union of a string and an int which has an int value of 1 would be
> encoded in Json as {"int":1}.)
> For a given schema it is simple to write a short Java program that
> converts from Json to Avro.  A general tool for such conversions
> doesn't yet exist but would make a great addition to Avro (if anyone's
> looking for a way to contribute).  The core of this might be a method
> that walks a JsonNode and a Schema in parallel, returning an object in
> Avro's generic representation.
> Doug
Doug Cutting 2012-09-18, 19:38
Markus Strickler 2012-09-19, 15:44
Russell Jurney 2012-09-18, 23:18
Markus Strickler 2012-09-19, 15:54