Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Issue writing union in avro?


+
Jonathan Coveney 2013-04-05, 09:24
+
Jonathan Coveney 2013-04-05, 11:15
+
Jonathan Coveney 2013-04-05, 11:59
+
Scott Carey 2013-04-06, 20:36
+
Jonathan Coveney 2013-04-07, 01:42
+
Jonathan Coveney 2013-04-07, 01:49
+
Scott Carey 2013-04-07, 07:16
+
Jonathan Coveney 2013-04-07, 10:47
Copy link to this message
-
Re: Issue writing union in avro?
I will open a JIRA ticket to request a Python StrictJSONEncoder that
produces these type-hints. Probably a StrictJSONDecoder needs to be there
too -- at any rate, the StrictJSONDecoder would be nice so that Python
could consume JSON-encoded output from Java et al.

A StrictJSON{Decoder,Encoder} might provide a (high-IO) workaround to
Jeremy Karn's problem about how to consume avro over a non-seekable
filehandle (e.g., standard in).

As I understand it:

The Python avro library doesn't have a JSON encoder at all: it has a binary
decoder, which deserializes to Python generics. These generics conveniently
serialize to JSON using the json.dumps core library call, but **json.dumps
on a python object is NOT the same as json encoding Avro**.

There are actually two slightly different understandings of "encoded in
JSON" built into the discussion around Python:

  (a) json.dumps(obj) on the Python generic

  (b) "strict" json encoding, which would require knowing when the schema
expects a union and inserting the extra key-name type hint.

(b) is required to preserve type information reliably in JSON, but type
information for union members may *always* be lost in a round trip to
Python generics. if something is encoded as a 'long' when the schema reads
['int', 'long'], the Python code does not guarantee that a avro>python>avro
round trip will be encoded as 'long',

--Jeremy
On Apr 7, 2013 3:47 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:

> Thanks, that is very helpful. It actually makes complete sense (note the
> other email where I was wondering exactly  how avro dealt with unions of
> similar types), I guess what threw me off is that the python implementation
> worked fine.
>
> Thanks again
> Jon
>
>
> 2013/4/7 Scott Carey <[EMAIL PROTECTED]>
>
>> It is well documented in the specification:
>> http://avro.apache.org/docs/current/spec.html#json_encoding
>>
>> I know others have overridden this behavior by extending GenericData
>> and/or the JsonDecoder/Encoder.  It wouldn't conform to the Avro
>> Specification JSON, but you can extend avro do do what you need it to.
>>
>> The reason for this encoding is to make sure that round-tripping data
>> from binary to json and back results in the same data.  Additionally,
>> unions can be more complicated and contain multiple records each with
>> different names.  Disambiguating the value requires more information since
>> several Avro data types map to the same JSON data type.  If the schema is a
>> union of bytes and string, is "hello" a string, or byte literal?  If it is
>> a union of a map and a record, is {"state":"CA", "city":"Pittsburgh"}  a
>> record with two string fields, or a map?   There are other approaches, and
>> for some users perfect transmission of types is not critical.  Generally
>> speaking, if you want to output Avro data as JSON and consume as JSON, the
>> extra data is not helpful.  If you want to read it back in as Avro, you're
>> going to need the info to know which branch of the union to take.
>>
>> On 4/6/13 6:49 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>>
>> Err, it's the output format that deserializes the json and then writes it
>> in the binary format, not the input format. But either way the general flow
>> is the same.
>>
>> As a general aside, is it the case that the java case is correct in that
>> when writing a union it should be {"string": "hello"} or whatnot? Seems
>> like we should probably add that to the documentation if it is a
>> requirement.
>>
>>
>> 2013/4/7 Jonathan Coveney <[EMAIL PROTECTED]>
>>
>>> Scott,
>>>
>>> Thanks for the input. The use case is that a number of our batch
>>> processes are built on python streaming. Currently, the reducer will output
>>> a json string as a value, and then the input format will deserialize the
>>> json, and then write it in binary format.
>>>
>>> Given that our use of python streaming isn't going away, any suggestions
>>> on how to make this better? Is there a better way to go from json string ->
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB