Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Issue writing union in avro?


+
Jonathan Coveney 2013-04-05, 09:24
+
Jonathan Coveney 2013-04-05, 11:15
+
Jonathan Coveney 2013-04-05, 11:59
+
Scott Carey 2013-04-06, 20:36
+
Jonathan Coveney 2013-04-07, 01:42
+
Jonathan Coveney 2013-04-07, 01:49
+
Scott Carey 2013-04-07, 07:16
Copy link to this message
-
Re: Issue writing union in avro?
Thanks, that is very helpful. It actually makes complete sense (note the
other email where I was wondering exactly  how avro dealt with unions of
similar types), I guess what threw me off is that the python implementation
worked fine.

Thanks again
Jon
2013/4/7 Scott Carey <[EMAIL PROTECTED]>

> It is well documented in the specification:
> http://avro.apache.org/docs/current/spec.html#json_encoding
>
> I know others have overridden this behavior by extending GenericData
> and/or the JsonDecoder/Encoder.  It wouldn't conform to the Avro
> Specification JSON, but you can extend avro do do what you need it to.
>
> The reason for this encoding is to make sure that round-tripping data from
> binary to json and back results in the same data.  Additionally, unions can
> be more complicated and contain multiple records each with different names.
>  Disambiguating the value requires more information since several Avro data
> types map to the same JSON data type.  If the schema is a union of bytes
> and string, is "hello" a string, or byte literal?  If it is a union of a
> map and a record, is {"state":"CA", "city":"Pittsburgh"}  a record with two
> string fields, or a map?   There are other approaches, and for some users
> perfect transmission of types is not critical.  Generally speaking, if you
> want to output Avro data as JSON and consume as JSON, the extra data is not
> helpful.  If you want to read it back in as Avro, you're going to need the
> info to know which branch of the union to take.
>
> On 4/6/13 6:49 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>
> Err, it's the output format that deserializes the json and then writes it
> in the binary format, not the input format. But either way the general flow
> is the same.
>
> As a general aside, is it the case that the java case is correct in that
> when writing a union it should be {"string": "hello"} or whatnot? Seems
> like we should probably add that to the documentation if it is a
> requirement.
>
>
> 2013/4/7 Jonathan Coveney <[EMAIL PROTECTED]>
>
>> Scott,
>>
>> Thanks for the input. The use case is that a number of our batch
>> processes are built on python streaming. Currently, the reducer will output
>> a json string as a value, and then the input format will deserialize the
>> json, and then write it in binary format.
>>
>> Given that our use of python streaming isn't going away, any suggestions
>> on how to make this better? Is there a better way to go from json string ->
>> writing binary avro data?
>>
>> Thanks again
>> Jon
>>
>>
>> 2013/4/6 Scott Carey <[EMAIL PROTECTED]>
>>
>>> This is due to using the JSON encoding for avro and not the binary
>>> encoding.  It would appear that the Python version is a little bit lax on
>>> the spec.  Some have built variations of the JSON encoding that do not
>>> label the union, but there are drawbacks to this too, as the type can be
>>> ambiguous in a very large number of cases without a label.
>>>
>>> Why are you using the JSON encoding for Avro?  The primary purpose of
>>> the JSON serialization form as it is now is for transforming the binary to
>>> human readable form.
>>> Instead of building your GenericRecord from a JSON string, try using
>>> GenericRecordBuilder.
>>>
>>> -Scott
>>>
>>> On 4/5/13 4:59 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>>>
>>> Ok, I figured out the issue:
>>>
>>> If you make string c the following:
>>> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256},
>>> \"favorite_color\": {\"string\": \"blue\"}}";
>>>
>>> Then this works.
>>>
>>> This represents a divergence between the python and the Java
>>> implementation... the above does not work in Python, but it does work in
>>> Java. And of course, vice versa.
>>>
>>> I think I know how to fix this (and can file a bug with my reproduction
>>> and the fix), but I'm not sure which one is the expected case? Which
>>> implementation is wrong?
>>>
>>> Thanks
>>>
>>>
>>> 2013/4/5 Jonathan Coveney <[EMAIL PROTECTED]>
+
Jeremy Kahn 2013-04-09, 16:16
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB