Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Converting arbitrary JSON to avro


Copy link to this message
-
Re: Converting arbitrary JSON to avro
Hi Doug,

thanks for the suggestion, I wasn't aware that one can specify anything else than a record as top level element of a schema.
I tried this, and it works well for a flat data, but with nested structures you still have to use the value indirection, or so it seems.
Also I think this change might break existing code relying on the current structure.

As far as size and performance go, I probably have to run some tests on real data, once I've come up with an appropriate schema that actually matches the data.

Again, thanks a lot for your help.

Best,
-markus

Am 18.09.2012 um 21:38 schrieb Doug Cutting:

> On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <[EMAIL PROTECTED]> wrote:
>> Json.Writer is indeed what I had in mind and I have successfully managed to convert my existing JSON to avro using it.
>> However using GenericDatumReader on this feels pretty unnatural, as I seem to be unable to access fields directly. It seems I have to access the "value" field on each record which returns a Map which uses Utf8 Objects as keys for the actual fields. Or am I doing something wrong here?
>
> Hmm.  We could re-factor Json.SCHEMA so the union is the top-level
> element.  That would get rid of the wrapper around every value.  It's
> a more redundant way to write the schema, but the binary encoding is
> identical (since a record wrapper adds no bytes).  It would hence
> require no changes to Json.Reader or Json.Writer.
>
> [ "long",
>  "double",
>  "string",
>  "boolean",
>  "null",
>  {"type" : "array",
>   "items" : {
>       "type" : "record",
>       "name" : "org.apache.avro.data.Json",
>       "fields" : [ {
>           "name" : "value",
>           "type" : [ "long", "double", "string", "boolean", "null",
>                      {"type" : "array", "items" : "Json"},
>                      {"type" : "map", "values" : "Json"}
>                    ]
>       } ]
>   }
>  },
>  {"type" : "map", "values" : "Json"}
> ]
>
> You can try this by placing this schema in
> share/schemas/org/apache/avro/data/Json.avsc and re-building the avro
> jar.
>
> Would such a change be useful to you?  If so, please file an issue in Jira.
>
> Or we could even refactor this schema so that a Json object is the
> top-level structure:
>
> {"type" : "map",
> "values" : [ "long",
>              "double",
>              "string",
>              "boolean",
>              "null",
>              {"type" : "array",
>               "items" : {
>                   "type" : "record",
>                   "name" : "org.apache.avro.data.Json",
>                   "fields" : [ {
>                       "name" : "value",
>                       "type" : [ "long", "double", "string", "boolean", "null",
>                                  {"type" : "array", "items" : "Json"},
>                                  {"type" : "map", "values" : "Json"}
>                                ]
>                   } ]
>               }
>              },
>              {"type" : "map", "values" : "Json"}
>            ]
> }
>
> This would change the binary format but would not change the
> representation that GenericDatumReader would hand you from my first
> example above (since the generic representation unwraps unions).
> Using this schema would require changes to Json.Writer and
> Json.Reader.  It would better conform to the definition of Json, which
> only permits objects as the top-level type.
>
>> Concerning the more specific schema, you are of course completely right. Unfortunately more or less all the fields in the JSON data format are optional and many have substructures, so, at least in my understanding, I have to use unions of null and the actual type throughout the schema. I tried using JsonDecoder first (or rather the fromjson option of the avro tool, which, I think, uses JsonDecoder) but given the current JSON structures, this didn't work.
>
>> So I'll probably have to look into implementing my own converter.  However given the rather complex structure of the original JSON I'm wondering if trying to represent the data in avro is such a good idea in the first place.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB