Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Newb question on imorting JSON and defaults


+
Gregory 2013-05-22, 21:26
Copy link to this message
-
Re: Newb question on imorting JSON and defaults


On 5/22/13 2:26 PM, "Gregory (Grisha) Trubetskoy" <[EMAIL PROTECTED]>
wrote:

>
>Hello!
>
>I have a test.json file that looks like this:
>
>{"first":"John", "last":"Doe", "middle":"C"}
>{"first":"John", "last":"Doe"}
>
>(Second line does NOT have a "middle" element).
>
>And I have a test.schema file that looks like this:
>
>{"name":"test",
>  "type":"record",
>  "fields": [
>     {"name":"first",  "type":"string"},
>     {"name":"middle", "type":"string", "default":""},
>     {"name":"last",   "type":"string"}
>]}
>
>I then try to use fromjson, as follows, and it chokes on the second line:
>
>$ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema
>test.json > test.avro
>Exception in thread "main" org.apache.avro.AvroTypeException: Expected
>field name not found: middle
>         at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
>         at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>         at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
>         at
>org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219)
>         at
>org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
>         at
>org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107
>)
>         at
>org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>ava:348)
>         at
>org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>ava:341)
>         at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:15
>4)
>         at
>org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
>ava:177)
>         at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
>8)
>         at
>org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
>9)
>         at
>org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105)
>         at org.apache.avro.tool.Main.run(Main.java:80)
>         at org.apache.avro.tool.Main.main(Main.java:69)
>
>
>The short story is - I need to convert a bunch of JSON where an element
>may not be present sometimes, in which case I'd want it to default to
>something sensible, e.g. blank or null.
>
>According to the Schema Resolution "if the reader's record schema has a
>field that contains a default value, and writer's schema does not have a
>field with the same name, then the reader should use the default value
>from its field."
>
>I'm clearly missing something obvious, any help would be appreciated!

There are two things that seem to be missing here:
 1. The fromjson tool is configuring the "writer's schema" (and readers's)
the one you provided.   Avro is expecting every
JSON fragment you are giving it to have the same schema.
 2. The tool will not work for all arbitrary json, it expects json in the
format that the Avro JSON Encoder writes.  There are a few differences
with expectations, primarily when disambiguating union types and maps from
records.

To perform schema evolution while reading, you may need to separate json
fragments missing "middle" from those that have it, and run the tool
twice, with corresponding schemas for each case.
Alternatively the tool could be modified to handle schema resolution or
deal with different json encodings as
well(tools/src/main/java/org/apache/avro/tool/DataFileWriteTool).

Alternatively, you can avoid schema resolution and write two files, one
with data in each schema after separating the records.   Then you can deal
with schema resolution in a later pass through the data with other tools
(e.g. data file reader + writer), or only lazily
when reading resolve the data into the schema you wish.

>
>Grisha
>
+
Gregory 2013-05-23, 20:07
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB