Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - How to declare an  optional field


Copy link to this message
-
Re: How to declare an optional field
Doug Cutting 2012-06-07, 17:03
It looks like you're perhaps using GenericData#toString() for output
then using JsonDecoder for input.  These unfortunately do not encode
Avro data in JSON compatibly.

JsonEncoder/JsonDecoder are lossless (implementing the rules in
http://avro.apache.org/docs/current/spec.html#json_encoding) while
GenericData#toString() generates the JSON that most folks expect.  The
difference centers around unions.  Avro differentiates between int and
long, string and bytes or enum, record and map; and so when these are
combined in a union it must tag them in the JSON with the intended
branch.  For example, if you have a record X with a field "a" in the
union ["X", {"type":"map", "values":"int"}] then Avro wouldn't know
which was meant when reading {"a":1}, so must encode this as {"X":
{"a":1}} or {"map": {"a":1}} in order to tell.

Perhaps GenericData#toString() should use this encoding, but in many
cases folks want the simpler JSON when producing output that's won't
be consumed by Avro.

If this is indeed what's causing you problems, the fix is to replace
your use of  GenericData#toString() with a DatumWriter that uses a
JsonEncoder.

Cheers,

Doug

On Thu, Jun 7, 2012 at 1:48 AM, François Kawala <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Firstly thanks for your help. I've corrected my schema according to your
> advice, but I've still the same kind of issue :
>
> ________________________________
>
> With this schema :
>
> (...)
> {"name": "in_reply_to", "type": ["null", "long" ], "default": null },
> (...)
> {"name":"urls","type":["null",{"type":"array","items": (record) }]}
> (...)
>
> Using this schema, the following data :
>
> {"created_at": "Mon, 28 May 2012 00:01:25 +0000", "emitter": 405427230,
> "emitter_name": "CallmeOceane_", "geo": null, "hashtags": null,
> "in_reply_to": 206897508021055489,
> "lang": "fr", "msg": "@Chloe_OneD Aaaah puuuuutain j'ai toujours pas finis
> Wild Souls machin truc", "uid": 206897932501385217, "urls": null,
> "usermentions":
> [{"id": 288136906, "indices": [0, 11], "name": "Happiness \u10e6",
> "screen_name": "Chloe_OneD"}]}|
>
> Ends on this error :
>
> 2012-06-07 10:16:07,831 WARN org.apache.hadoop.streaming.PipeMapRed:
> org.apache.avro.AvroTypeException: Expected start-union. Got
> VALUE_NUMBER_INT
> at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460)
> at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418)
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
> at
> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102)
> at
> com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446)
> at
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421)
>
> ________________________________
>
> While using this data :
>
> {"created_at": "Mon, 28 May 2012 00:00:10 +0000", "emitter": 59809965,
> "emitter_name": "Droolius", "geo": null, "hashtags": null, "in_reply_to":
> null, "lang": "en", "msg":
> "RT @davidchang: Thank you again Amy Rowat & team UCLA @scienceandfood :
> Umami Reverse Engineering + The Joy of MSG http://t.co/nk1QBGbg", "uid":
> 206897616326377472,
> "urls": [{"display_url": "bit.ly/KvD0QZ", "expanded_url":
> "http://bit.ly/KvD0QZ", "indices": [119, 139], "url":
> "http://t.co/nk1QBGbg"}],
> "usermentions": [{"id": 221185711, "indices": [3, 14], "name": "Dave Chang",