Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Issue writing union in avro?


Copy link to this message
-
Re: Issue writing union in avro?
Jonathan Coveney 2013-04-07, 01:49
Err, it's the output format that deserializes the json and then writes it
in the binary format, not the input format. But either way the general flow
is the same.

As a general aside, is it the case that the java case is correct in that
when writing a union it should be {"string": "hello"} or whatnot? Seems
like we should probably add that to the documentation if it is a
requirement.
2013/4/7 Jonathan Coveney <[EMAIL PROTECTED]>

> Scott,
>
> Thanks for the input. The use case is that a number of our batch processes
> are built on python streaming. Currently, the reducer will output a json
> string as a value, and then the input format will deserialize the json, and
> then write it in binary format.
>
> Given that our use of python streaming isn't going away, any suggestions
> on how to make this better? Is there a better way to go from json string ->
> writing binary avro data?
>
> Thanks again
> Jon
>
>
> 2013/4/6 Scott Carey <[EMAIL PROTECTED]>
>
>> This is due to using the JSON encoding for avro and not the binary
>> encoding.  It would appear that the Python version is a little bit lax on
>> the spec.  Some have built variations of the JSON encoding that do not
>> label the union, but there are drawbacks to this too, as the type can be
>> ambiguous in a very large number of cases without a label.
>>
>> Why are you using the JSON encoding for Avro?  The primary purpose of the
>> JSON serialization form as it is now is for transforming the binary to
>> human readable form.
>> Instead of building your GenericRecord from a JSON string, try using
>> GenericRecordBuilder.
>>
>> -Scott
>>
>> On 4/5/13 4:59 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>>
>> Ok, I figured out the issue:
>>
>> If you make string c the following:
>> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256},
>> \"favorite_color\": {\"string\": \"blue\"}}";
>>
>> Then this works.
>>
>> This represents a divergence between the python and the Java
>> implementation... the above does not work in Python, but it does work in
>> Java. And of course, vice versa.
>>
>> I think I know how to fix this (and can file a bug with my reproduction
>> and the fix), but I'm not sure which one is the expected case? Which
>> implementation is wrong?
>>
>> Thanks
>>
>>
>> 2013/4/5 Jonathan Coveney <[EMAIL PROTECTED]>
>>
>>> Correction: the issue is when reading the string according to the avro
>>> schema, not on writing. it fails before I get a chance to write :)
>>>
>>>
>>> 2013/4/5 Jonathan Coveney <[EMAIL PROTECTED]>
>>>
>>>> I implemented essentially the Java avro example but using the
>>>> GenericDatumWriter and GenericDatumReader and hit an issue.
>>>>
>>>> https://gist.github.com/jcoveney/5317904
>>>>
>>>> This is the error:
>>>> Exception in thread "main" java.lang.RuntimeException:
>>>> org.apache.avro.AvroTypeException: Expected start-union. Got
>>>> VALUE_NUMBER_INT
>>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
>>>> Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got
>>>> VALUE_NUMBER_INT
>>>>     at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
>>>>     at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
>>>>     at
>>>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
>>>>     at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>>>     at
>>>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
>>>>     at
>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
>>>>     at
>>>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>>>>     at
>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>>>>     at
>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)
>>>>
>>>> Am I doing something wrong? Is this a bug? I'm digging in now but am
>>>> curious if anyone has seen this before?