Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Converting arbitrary JSON to avro


Copy link to this message
-
Re: Converting arbitrary JSON to avro
Fwiw, I do this in web apps all the time via the python avro lib and json.dumps

Russell Jurney
twitter.com/rjurney
[EMAIL PROTECTED]
datasyndrome.com

On Sep 18, 2012, at 12:38 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <[EMAIL PROTECTED]> wrote:
>> Json.Writer is indeed what I had in mind and I have successfully managed to convert my existing JSON to avro using it.
>> However using GenericDatumReader on this feels pretty unnatural, as I seem to be unable to access fields directly. It seems I have to access the "value" field on each record which returns a Map which uses Utf8 Objects as keys for the actual fields. Or am I doing something wrong here?
>
> Hmm.  We could re-factor Json.SCHEMA so the union is the top-level
> element.  That would get rid of the wrapper around every value.  It's
> a more redundant way to write the schema, but the binary encoding is
> identical (since a record wrapper adds no bytes).  It would hence
> require no changes to Json.Reader or Json.Writer.
>
> [ "long",
>  "double",
>  "string",
>  "boolean",
>  "null",
>  {"type" : "array",
>   "items" : {
>       "type" : "record",
>       "name" : "org.apache.avro.data.Json",
>       "fields" : [ {
>           "name" : "value",
>           "type" : [ "long", "double", "string", "boolean", "null",
>                      {"type" : "array", "items" : "Json"},
>                      {"type" : "map", "values" : "Json"}
>                    ]
>       } ]
>   }
>  },
>  {"type" : "map", "values" : "Json"}
> ]
>
> You can try this by placing this schema in
> share/schemas/org/apache/avro/data/Json.avsc and re-building the avro
> jar.
>
> Would such a change be useful to you?  If so, please file an issue in Jira.
>
> Or we could even refactor this schema so that a Json object is the
> top-level structure:
>
> {"type" : "map",
> "values" : [ "long",
>              "double",
>              "string",
>              "boolean",
>              "null",
>              {"type" : "array",
>               "items" : {
>                   "type" : "record",
>                   "name" : "org.apache.avro.data.Json",
>                   "fields" : [ {
>                       "name" : "value",
>                       "type" : [ "long", "double", "string", "boolean", "null",
>                                  {"type" : "array", "items" : "Json"},
>                                  {"type" : "map", "values" : "Json"}
>                                ]
>                   } ]
>               }
>              },
>              {"type" : "map", "values" : "Json"}
>            ]
> }
>
> This would change the binary format but would not change the
> representation that GenericDatumReader would hand you from my first
> example above (since the generic representation unwraps unions).
> Using this schema would require changes to Json.Writer and
> Json.Reader.  It would better conform to the definition of Json, which
> only permits objects as the top-level type.
>
>> Concerning the more specific schema, you are of course completely right. Unfortunately more or less all the fields in the JSON data format are optional and many have substructures, so, at least in my understanding, I have to use unions of null and the actual type throughout the schema. I tried using JsonDecoder first (or rather the fromjson option of the avro tool, which, I think, uses JsonDecoder) but given the current JSON structures, this didn't work.
>
>> So I'll probably have to look into implementing my own converter.  However given the rather complex structure of the original JSON I'm wondering if trying to represent the data in avro is such a good idea in the first place.
>
> It would be interesting to see whether, with the appropriate schema,
> whether the dataset is smaller and faster to process as Avro than as
> Json.  If you have 1000 fields in your data but the typical record
> only has one or two non-null, then an Avro record is perhaps not a
> good representation.  An Avro map might be better, but if the values
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB