Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Converting arbitrary JSON to avro


Copy link to this message
-
Re: Converting arbitrary JSON to avro
On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <[EMAIL PROTECTED]> wrote:
> Json.Writer is indeed what I had in mind and I have successfully managed to convert my existing JSON to avro using it.
> However using GenericDatumReader on this feels pretty unnatural, as I seem to be unable to access fields directly. It seems I have to access the "value" field on each record which returns a Map which uses Utf8 Objects as keys for the actual fields. Or am I doing something wrong here?

Hmm.  We could re-factor Json.SCHEMA so the union is the top-level
element.  That would get rid of the wrapper around every value.  It's
a more redundant way to write the schema, but the binary encoding is
identical (since a record wrapper adds no bytes).  It would hence
require no changes to Json.Reader or Json.Writer.

[ "long",
  "double",
  "string",
  "boolean",
  "null",
  {"type" : "array",
   "items" : {
       "type" : "record",
       "name" : "org.apache.avro.data.Json",
       "fields" : [ {
           "name" : "value",
           "type" : [ "long", "double", "string", "boolean", "null",
                      {"type" : "array", "items" : "Json"},
                      {"type" : "map", "values" : "Json"}
                    ]
       } ]
   }
  },
  {"type" : "map", "values" : "Json"}
]

You can try this by placing this schema in
share/schemas/org/apache/avro/data/Json.avsc and re-building the avro
jar.

Would such a change be useful to you?  If so, please file an issue in Jira.

Or we could even refactor this schema so that a Json object is the
top-level structure:

{"type" : "map",
 "values" : [ "long",
              "double",
              "string",
              "boolean",
              "null",
              {"type" : "array",
               "items" : {
                   "type" : "record",
                   "name" : "org.apache.avro.data.Json",
                   "fields" : [ {
                       "name" : "value",
                       "type" : [ "long", "double", "string", "boolean", "null",
                                  {"type" : "array", "items" : "Json"},
                                  {"type" : "map", "values" : "Json"}
                                ]
                   } ]
               }
              },
              {"type" : "map", "values" : "Json"}
            ]
}

This would change the binary format but would not change the
representation that GenericDatumReader would hand you from my first
example above (since the generic representation unwraps unions).
Using this schema would require changes to Json.Writer and
Json.Reader.  It would better conform to the definition of Json, which
only permits objects as the top-level type.

> Concerning the more specific schema, you are of course completely right. Unfortunately more or less all the fields in the JSON data format are optional and many have substructures, so, at least in my understanding, I have to use unions of null and the actual type throughout the schema. I tried using JsonDecoder first (or rather the fromjson option of the avro tool, which, I think, uses JsonDecoder) but given the current JSON structures, this didn't work.

> So I'll probably have to look into implementing my own converter.  However given the rather complex structure of the original JSON I'm wondering if trying to represent the data in avro is such a good idea in the first place.

It would be interesting to see whether, with the appropriate schema,
whether the dataset is smaller and faster to process as Avro than as
Json.  If you have 1000 fields in your data but the typical record
only has one or two non-null, then an Avro record is perhaps not a
good representation.  An Avro map might be better, but if the values
are similarly variable then Json might be competitive.

Cheers,

Doug