Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Issue writing union in avro?


Copy link to this message
-
Re: Issue writing union in avro?
Scott,

Thanks for the input. The use case is that a number of our batch processes
are built on python streaming. Currently, the reducer will output a json
string as a value, and then the input format will deserialize the json, and
then write it in binary format.

Given that our use of python streaming isn't going away, any suggestions on
how to make this better? Is there a better way to go from json string ->
writing binary avro data?

Thanks again
Jon
2013/4/6 Scott Carey <[EMAIL PROTECTED]>

> This is due to using the JSON encoding for avro and not the binary
> encoding.  It would appear that the Python version is a little bit lax on
> the spec.  Some have built variations of the JSON encoding that do not
> label the union, but there are drawbacks to this too, as the type can be
> ambiguous in a very large number of cases without a label.
>
> Why are you using the JSON encoding for Avro?  The primary purpose of the
> JSON serialization form as it is now is for transforming the binary to
> human readable form.
> Instead of building your GenericRecord from a JSON string, try using
> GenericRecordBuilder.
>
> -Scott
>
> On 4/5/13 4:59 AM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:
>
> Ok, I figured out the issue:
>
> If you make string c the following:
> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256},
> \"favorite_color\": {\"string\": \"blue\"}}";
>
> Then this works.
>
> This represents a divergence between the python and the Java
> implementation... the above does not work in Python, but it does work in
> Java. And of course, vice versa.
>
> I think I know how to fix this (and can file a bug with my reproduction
> and the fix), but I'm not sure which one is the expected case? Which
> implementation is wrong?
>
> Thanks
>
>
> 2013/4/5 Jonathan Coveney <[EMAIL PROTECTED]>
>
>> Correction: the issue is when reading the string according to the avro
>> schema, not on writing. it fails before I get a chance to write :)
>>
>>
>> 2013/4/5 Jonathan Coveney <[EMAIL PROTECTED]>
>>
>>> I implemented essentially the Java avro example but using the
>>> GenericDatumWriter and GenericDatumReader and hit an issue.
>>>
>>> https://gist.github.com/jcoveney/5317904
>>>
>>> This is the error:
>>> Exception in thread "main" java.lang.RuntimeException:
>>> org.apache.avro.AvroTypeException: Expected start-union. Got
>>> VALUE_NUMBER_INT
>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45)
>>> Caused by: org.apache.avro.AvroTypeException: Expected start-union. Got
>>> VALUE_NUMBER_INT
>>>     at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
>>>     at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
>>>     at
>>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
>>>     at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>>>     at
>>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
>>>     at
>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
>>>     at
>>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>>>     at
>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>>>     at
>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>>>     at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38)
>>>
>>> Am I doing something wrong? Is this a bug? I'm digging in now but am
>>> curious if anyone has seen this before?
>>>
>>> I get the feeling I am working with Avro in a way that most people do
>>> not :)
>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB