Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Issue writing union in avro?


Copy link to this message
-
Re: Issue writing union in avro?
Jonathan Coveney 2013-04-07, 10:47
Thanks, that is very helpful. It actually makes complete sense (note the
other email where I was wondering exactly  how avro dealt with unions of
similar types), I guess what threw me off is that the python implementation
worked fine.

Thanks again
Jon
2013/4/7 Scott Carey <[EMAIL PROTECTED]>

> It is well documented in the specification:
> http://avro.apache.org/docs/current/spec.html#json_encoding
>
> I know others have overridden this behavior by extending GenericData
> and/or the JsonDecoder/Encoder.  It wouldn't conform to the Avro
> Specification JSON, but you can extend avro do do what you need it to.
>
> The reason for this encoding is to make sure that round-tripping data from
> binary to json and back results in the same data.  Additionally, unions can
> be more complicated and contain multiple records each with different names.
>  Disambiguating the value requires more information since several Avro data
> types map to the same JSON data type.  If the schema is a union of bytes
> and string, is "hello" a string, or byte literal?  If it is a union of a
> map and a record, is {"state":"CA", "city":"Pittsburgh"}  a record with two
> string fields, or a map?   There are other approaches, and for some users
> perfect transmission of types is not critical.  Generally speaking, if you
> want to output Avro data as JSON and consume as JSON, the extra data is not
> helpful.  If you want to read it back in as Avro, you're going to need the
> info to know which branch of the union to take.
>
> On 4/6/13 6:49 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>
> Err, it's the output format that deserializes the json and then writes it
> in the binary format, not the input format. But either way the general flow
> is the same.
>
> As a general aside, is it the case that the java case is correct in that
> when writing a union it should be {"string": "hello"} or whatnot? Seems
> like we should probably add that to the documentation if it is a
> requirement.
>
>
> 2013/4/7 Jonathan Coveney <[EMAIL PROTECTED]>
>
>> Scott,
>>
>> Thanks for the input. The use case is that a number of our batch
>> processes are built on python streaming. Currently, the reducer will output
>> a json string as a value, and then the input format will deserialize the
>> json, and then write it in binary format.
>>
>> Given that our use of python streaming isn't going away, any suggestions
>> on how to make this better? Is there a better way to go from json string ->
>> writing binary avro data?
>>
>> Thanks again
>> Jon
>>
>>
>> 2013/4/6 Scott Carey <[EMAIL PROTECTED]>
>>
>>> This is due to using the JSON encoding for avro and not the binary
>>> encoding.  It would appear that the Python version is a little bit lax on
>>> the spec.  Some have built variations of the JSON encoding that do not
>>> label the union, but there are drawbacks to this too, as the type can be
>>> ambiguous in a very large number of cases without a label.
>>>
>>> Why are you using the JSON encoding for Avro?  The primary purpose of
>>> the JSON serialization form as it is now is for transforming the binary to
>>> human readable form.
>>> Instead of building your GenericRecord from a JSON string, try using
>>> GenericRecordBuilder.
>>>
>>> -Scott
>>>
>>> On 4/5/13 4:59 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>>>
>>> Ok, I figured out the issue:
>>>
>>> If you make string c the following:
>>> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256},
>>> \"favorite_color\": {\"string\": \"blue\"}}";
>>>
>>> Then this works.
>>>
>>> This represents a divergence between the python and the Java
>>> implementation... the above does not work in Python, but it does work in
>>> Java. And of course, vice versa.
>>>
>>> I think I know how to fix this (and can file a bug with my reproduction
>>> and the fix), but I'm not sure which one is the expected case? Which
>>> implementation is wrong?
>>>
>>> Thanks
>>>
>>>
>>> 2013/4/5 Jonathan Coveney <[EMAIL PROTECTED]>