Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Generating an avro schema with optional values


+
Guillaume Roger 2014-04-09, 14:00
Copy link to this message
-
Re: Generating an avro schema with optional values
Hi Guillaume,

Avro is primarily a binary serialization format. The JSON representation of data is mostly a convenience for debugging purposes, but isn't really the main purpose of Avro.

In particular, Avro schemas aren't expected to be able to describe arbitrary JSON documents. You can represent your missing "source" field like this:

{"valid": {"boolean": true}, "source": null}

...but a JSON document that's missing the "source" field entirely isn't expected to be valid.

Martin

On 9 Apr 2014, at 06:59, Guillaume Roger <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hi,

I am trying to write a very easy avro schema (easy because I am just pointing out my current issue) to write an avro data file based on data stored in json format. The trick is that one field is optional, and one of avrotools or me is not doing it right.

The goal is not to write my own serialiser, the endgoal will be to have this in flume, I am in the early stages.

The data (works), in a file named so.log:

{
  "valid":  {"boolean":true}
, "source": {"bytes":"live"}
}

The schema, in a file named so.avsc:

{
  "type":"record",
  "name":"Event",
  "fields":[
      {"name":"valid", "type": ["null", "boolean"],"default":null}
    , {"name":"source","type": ["null", "bytes"],"default":null}
  ]
}
I can easily generate an avro file with the following command:

java -jar avro-tools-1.7.6.jar fromjson --schema-file so.avsc so.log

So far so good. The thing is that "source" is optional, so I would expect the following data to be valid as well:

{
  "valid": {"boolean":true}
}

But running the same command gives me the error:

Exception in thread "main" org.apache.avro.AvroTypeException: Expected start-union. Got END_OBJECT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:99)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
I did try a lot of variations in the schema, even things that do not follow the avro spec. The schema I show here is, as far as I know, what the spec says it should be.

Would anybody know what I am doing wrong, and how I can actually have optional elements without writing my own serialiser?

Thanks,

Guillaume ROGER, Datawarehouse Engineer
Spil Games, http://www.spilgames.com<http://www.spilgames.com/>
Arendstraat 23, 1223 RE Hilversum, The Netherlands

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB