Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> How is a union of multiple primitives handled?

Jonathan Coveney 2013-04-05, 16:11
Curt Hagenlocher 2013-04-05, 16:31
Jonathan Coveney 2013-04-06, 10:30
Copy link to this message
Re: How is a union of multiple primitives handled?
>From reading threads regarding issues with parsing Avro union types, and
struggling to solve my own problems with it, it appears like this is a
common "problem" when using SpecificRecord and SpecificDatumReader. More
specifically, if you are passing just a JSON string to the deserializer
(such as a SpeicificDatumReader using a JSONDecoder) the parser used
expects the types to be specifically identified in the serialized string. I
presume this is done to preserve type safety when assigning the values read
during deserialization, to the attribute fields of the object.

The most popular response on these threads is to suggest that simply use
GenericDatumReader. However, it appears to me that if we did that, all the
most important and useful functionality of Avro deserialization (and
serialization by converse logic) would be lost because this would mean that
Avro can only be used if both sender (writer) and receiver (reader) are
using Avro for what appear to be format agnostic encodings (such as
Ascii-text JSON, Protobuf, etc.)

Is there some kind of workaround so that we can continue to benefit from
immensely useful avro library components like schema builders, encoders,
decoders and such while having some flexibility in terms of the formatting
of inputs provided to these.


On Sat, Apr 6, 2013 at 6:30 AM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> Java has its own issues in this regard, which is that when deserializing a
> JSON String, if there is a union in the Schema then you have to give it
> {"<type>": <data>} which seems wrong to me (see the other email thread I
> started). I asked this question to understand how it should work in python,
> but also to get a sense of what the fix should be. I have made a patch that
> works according to my understanding, but I still am unsure if that
> understanding is correct, as well as if the Java treatment of unions in
> this case is correct (to me it seems needlessly cumbersome).
> Thanks for your help
> 2013/4/5 Curt Hagenlocher <[EMAIL PROTECTED]>
>> This is a Python-specific issue, and results from the interplay of two
>> implementation-specific features:
>> 1) Python ints, longs and floats can all legally be serialized as an Avro
>> double (or float). See io.py, line 118.
>> 2) The union serializer picks the first type that allows legal
>> serialization.
>> I would be surprised if you got the same thing in Java; it's not the kind
>> of behavior I would expect from a statically-typed language.
>> On Fri, Apr 5, 2013 at 9:11 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
>>> The following gist illustrates my question:
>>> https://gist.github.com/jcoveney/5320422
>>> It seems pretty surprising to me that all of these cases all return 1.0,
>>> at least in python (I will now do this in Java, it's just more verbose). Is
>>> this an issue with python? Is this an issue period? Is this unexpected?
>>> At the very least, if you write 1 to ["int", "double"] you'd expect that
>>> it'd get serialized as an int? Or is there a set of rules governing which
>>> primitive type to choose? Is it implementation dependent?
>>> Also, the case where it throws an error, then returns 0 seems completely
>>> wrong. Why would it do that at all? Is it that once it throws an error, it
>>> gets into an inconsistent state and nothing is guaranteed?
>>> Thanks for helping me understand this!
Pankaj Shroff