Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Newb question on imorting JSON and defaults


Copy link to this message
-
Re: Newb question on imorting JSON and defaults

Thanks Scott!

So it looks like "fromjson" is mainly meant for processing JSON generated
by "tojson" and not as a general "JSON importing tool" (although it could
be used as such) - it's probably my short attention span, but somehow that
point got lost on me. (As I later learned it also seems that the schema
that the fromjson expects is a simplified version - e.g. specifying a
union will give an error.)

So if I expect to be dealing with data coming in as JSON and would need to
be converting it to Avro - the current "best practice" is to write a
program of your own? This seems like a fairly common thing to do, perhaps
if there isn't a general tool, this could be something useful to hack on
for the Avro project...

Grisha

On Thu, 23 May 2013, Scott Carey wrote:

>
>
> On 5/22/13 2:26 PM, "Gregory (Grisha) Trubetskoy" <[EMAIL PROTECTED]>
> wrote:
>
>>
>> Hello!
>>
>> I have a test.json file that looks like this:
>>
>> {"first":"John", "last":"Doe", "middle":"C"}
>> {"first":"John", "last":"Doe"}
>>
>> (Second line does NOT have a "middle" element).
>>
>> And I have a test.schema file that looks like this:
>>
>> {"name":"test",
>>  "type":"record",
>>  "fields": [
>>     {"name":"first",  "type":"string"},
>>     {"name":"middle", "type":"string", "default":""},
>>     {"name":"last",   "type":"string"}
>> ]}
>>
>> I then try to use fromjson, as follows, and it chokes on the second line:
>>
>> $ java -jar avro-tools-1.7.4.jar fromjson --schema-file test.schema
>> test.json > test.avro
>> Exception in thread "main" org.apache.avro.AvroTypeException: Expected
>> field name not found: middle
>>         at org.apache.avro.io.JsonDecoder.doAction(JsonDecoder.java:477)
>>         at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>         at org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
>>         at
>> org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:219)
>>         at
>> org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
>>         at
>> org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107
>> )
>>         at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>> ava:348)
>>         at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.j
>> ava:341)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:15
>> 4)
>>         at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.j
>> ava:177)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:14
>> 8)
>>         at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:13
>> 9)
>>         at
>> org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:105)
>>         at org.apache.avro.tool.Main.run(Main.java:80)
>>         at org.apache.avro.tool.Main.main(Main.java:69)
>>
>>
>> The short story is - I need to convert a bunch of JSON where an element
>> may not be present sometimes, in which case I'd want it to default to
>> something sensible, e.g. blank or null.
>>
>> According to the Schema Resolution "if the reader's record schema has a
>> field that contains a default value, and writer's schema does not have a
>> field with the same name, then the reader should use the default value
>> from its field."
>>
>> I'm clearly missing something obvious, any help would be appreciated!
>
> There are two things that seem to be missing here:
> 1. The fromjson tool is configuring the "writer's schema" (and readers's)
> the one you provided.   Avro is expecting every
> JSON fragment you are giving it to have the same schema.
> 2. The tool will not work for all arbitrary json, it expects json in the
> format that the Avro JSON Encoder writes.  There are a few differences
> with expectations, primarily when disambiguating union types and maps from
> records.
>
> To perform schema evolution while reading, you may need to separate json
> fragments missing "middle" from those that have it, and run the tool