Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Newb question on imorting JSON and defaults


Copy link to this message
-
Re: Newb question on imorting JSON and defaults

Thanks Scott!

So it looks like "fromjson" is mainly meant for processing JSON generated
by "tojson" and not as a general "JSON importing tool" (although it could
be used as such) - it's probably my short attention span, but somehow that
point got lost on me. (As I later learned it also seems that the schema
that the fromjson expects is a simplified version - e.g. specifying a
union will give an error.)

So if I expect to be dealing with data coming in as JSON and would need to
be converting it to Avro - the current "best practice" is to write a
program of your own? This seems like a fairly common thing to do, perhaps
if there isn't a general tool, this could be something useful to hack on
for the Avro project...

Grisha

On Thu, 23 May 2013, Scott Carey wrote:

>
>
> On 5/22/13 2:26 PM, "Gregory (Grisha) Trubetskoy" <[EMAIL PROTECTED]>
> wrote:
>
>>
>> Hello!
>>
>> I have a test.json file that looks like this:
>>
>> {"first":"John", "last":"Doe", "middle":"C"}
>> {"first":"John", "last":"Doe"}
>>
>> (Second line does NOT have a "middle" element).
>>
>> And I have a test.schema file that looks like this:
>>
>> {"name":"test",
>>  "type":"record",
>>  "fields": [
>>     {"name":"first",  "type":"string"},
>>     {"name":"middle", "type":"string", "default":""},
>>     {"name":"last",   "type":"string"}
>> ]}
>>
>> I then try to use fromjson, as follows, and it chokes on the second line:
>>
>> $ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema
>> test.json > test.avro
>> Exception in thread "main" org.apache.avro.AvroTypeException: Expected
>> field name not found: middle
>>         at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
>>         at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>         at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
>>         at
>> org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219)
>>         at
>> org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
>>         at
>> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107
>> )
>>         at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>> ava:348)
>>         at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>> ava:341)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:15
>> 4)
>>         at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
>> ava:177)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
>> 8)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
>> 9)
>>         at
>> org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105)
>>         at org.apache.avro.tool.Main.run(Main.java:80)
>>         at org.apache.avro.tool.Main.main(Main.java:69)
>>
>>
>> The short story is - I need to convert a bunch of JSON where an element
>> may not be present sometimes, in which case I'd want it to default to
>> something sensible, e.g. blank or null.
>>
>> According to the Schema Resolution "if the reader's record schema has a
>> field that contains a default value, and writer's schema does not have a
>> field with the same name, then the reader should use the default value
>> from its field."
>>
>> I'm clearly missing something obvious, any help would be appreciated!
>
> There are two things that seem to be missing here:
> 1. The fromjson tool is configuring the "writer's schema" (and readers's)
> the one you provided.   Avro is expecting every
> JSON fragment you are giving it to have the same schema.
> 2. The tool will not work for all arbitrary json, it expects json in the
> format that the Avro JSON Encoder writes.  There are a few differences
> with expectations, primarily when disambiguating union types and maps from
> records.
>
> To perform schema evolution while reading, you may need to separate json
> fragments missing "middle" from those that have it, and run the tool
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB