Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Issue writing union in avro?

Jonathan Coveney 2013-04-05, 09:24
Jonathan Coveney 2013-04-05, 11:15
Jonathan Coveney 2013-04-05, 11:59
Scott Carey 2013-04-06, 20:36
Jonathan Coveney 2013-04-07, 01:42
Jonathan Coveney 2013-04-07, 01:49
Scott Carey 2013-04-07, 07:16
Jonathan Coveney 2013-04-07, 10:47
Copy link to this message
Re: Issue writing union in avro?
I will open a JIRA ticket to request a Python StrictJSONEncoder that
produces these type-hints. Probably a StrictJSONDecoder needs to be there
too -- at any rate, the StrictJSONDecoder would be nice so that Python
could consume JSON-encoded output from Java et al.

A StrictJSON{Decoder,Encoder} might provide a (high-IO) workaround to
Jeremy Karn's problem about how to consume avro over a non-seekable
filehandle (e.g., standard in).

As I understand it:

The Python avro library doesn't have a JSON encoder at all: it has a binary
decoder, which deserializes to Python generics. These generics conveniently
serialize to JSON using the json.dumps core library call, but **json.dumps
on a python object is NOT the same as json encoding Avro**.

There are actually two slightly different understandings of "encoded in
JSON" built into the discussion around Python:

  (a) json.dumps(obj) on the Python generic

  (b) "strict" json encoding, which would require knowing when the schema
expects a union and inserting the extra key-name type hint.

(b) is required to preserve type information reliably in JSON, but type
information for union members may *always* be lost in a round trip to
Python generics. if something is encoded as a 'long' when the schema reads
['int', 'long'], the Python code does not guarantee that a avro>python>avro
round trip will be encoded as 'long',

On Apr 7, 2013 3:47 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:

> Thanks, that is very helpful. It actually makes complete sense (note the
> other email where I was wondering exactly  how avro dealt with unions of
> similar types), I guess what threw me off is that the python implementation
> worked fine.
> Thanks again
> Jon
> 2013/4/7 Scott Carey <[EMAIL PROTECTED]>
>> It is well documented in the specification:
>> http://avro.apache.org/docs/current/spec.html#json_encoding
>> I know others have overridden this behavior by extending GenericData
>> and/or the JsonDecoder/Encoder.  It wouldn't conform to the Avro
>> Specification JSON, but you can extend avro do do what you need it to.
>> The reason for this encoding is to make sure that round-tripping data
>> from binary to json and back results in the same data.  Additionally,
>> unions can be more complicated and contain multiple records each with
>> different names.  Disambiguating the value requires more information since
>> several Avro data types map to the same JSON data type.  If the schema is a
>> union of bytes and string, is "hello" a string, or byte literal?  If it is
>> a union of a map and a record, is {"state":"CA", "city":"Pittsburgh"}  a
>> record with two string fields, or a map?   There are other approaches, and
>> for some users perfect transmission of types is not critical.  Generally
>> speaking, if you want to output Avro data as JSON and consume as JSON, the
>> extra data is not helpful.  If you want to read it back in as Avro, you're
>> going to need the info to know which branch of the union to take.
>> On 4/6/13 6:49 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>> Err, it's the output format that deserializes the json and then writes it
>> in the binary format, not the input format. But either way the general flow
>> is the same.
>> As a general aside, is it the case that the java case is correct in that
>> when writing a union it should be {"string": "hello"} or whatnot? Seems
>> like we should probably add that to the documentation if it is a
>> requirement.
>> 2013/4/7 Jonathan Coveney <[EMAIL PROTECTED]>
>>> Scott,
>>> Thanks for the input. The use case is that a number of our batch
>>> processes are built on python streaming. Currently, the reducer will output
>>> a json string as a value, and then the input format will deserialize the
>>> json, and then write it in binary format.
>>> Given that our use of python streaming isn't going away, any suggestions
>>> on how to make this better? Is there a better way to go from json string ->