|
|
-
Re: How to declare an optional fieldDoug Cutting 2012-06-07, 17:03
It looks like you're perhaps using GenericData#toString() for output
then using JsonDecoder for input. These unfortunately do not encode Avro data in JSON compatibly. JsonEncoder/JsonDecoder are lossless (implementing the rules in http://avro.apache.org/docs/current/spec.html#json_encoding) while GenericData#toString() generates the JSON that most folks expect. The difference centers around unions. Avro differentiates between int and long, string and bytes or enum, record and map; and so when these are combined in a union it must tag them in the JSON with the intended branch. For example, if you have a record X with a field "a" in the union ["X", {"type":"map", "values":"int"}] then Avro wouldn't know which was meant when reading {"a":1}, so must encode this as {"X": {"a":1}} or {"map": {"a":1}} in order to tell. Perhaps GenericData#toString() should use this encoding, but in many cases folks want the simpler JSON when producing output that's won't be consumed by Avro. If this is indeed what's causing you problems, the fix is to replace your use of GenericData#toString() with a DatumWriter that uses a JsonEncoder. Cheers, Doug On Thu, Jun 7, 2012 at 1:48 AM, François Kawala <[EMAIL PROTECTED]> wrote: > Hello, > > Firstly thanks for your help. I've corrected my schema according to your > advice, but I've still the same kind of issue : > > ________________________________ > > With this schema : > > (...) > {"name": "in_reply_to", "type": ["null", "long" ], "default": null }, > (...) > {"name":"urls","type":["null",{"type":"array","items": (record) }]} > (...) > > Using this schema, the following data : > > {"created_at": "Mon, 28 May 2012 00:01:25 +0000", "emitter": 405427230, > "emitter_name": "CallmeOceane_", "geo": null, "hashtags": null, > "in_reply_to": 206897508021055489, > "lang": "fr", "msg": "@Chloe_OneD Aaaah puuuuutain j'ai toujours pas finis > Wild Souls machin truc", "uid": 206897932501385217, "urls": null, > "usermentions": > [{"id": 288136906, "indices": [0, 11], "name": "Happiness \u10e6", > "screen_name": "Chloe_OneD"}]}| > > Ends on this error : > > 2012-06-07 10:16:07,831 WARN org.apache.hadoop.streaming.PipeMapRed: > org.apache.avro.AvroTypeException: Expected start-union. Got > VALUE_NUMBER_INT > at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460) > at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418) > at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) > at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) > at > com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102) > at > com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88) > at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446) > at > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421) > > ________________________________ > > While using this data : > > {"created_at": "Mon, 28 May 2012 00:00:10 +0000", "emitter": 59809965, > "emitter_name": "Droolius", "geo": null, "hashtags": null, "in_reply_to": > null, "lang": "en", "msg": > "RT @davidchang: Thank you again Amy Rowat & team UCLA @scienceandfood : > Umami Reverse Engineering + The Joy of MSG http://t.co/nk1QBGbg", "uid": > 206897616326377472, > "urls": [{"display_url": "bit.ly/KvD0QZ", "expanded_url": > "http://bit.ly/KvD0QZ", "indices": [119, 139], "url": > "http://t.co/nk1QBGbg"}], > "usermentions": [{"id": 221185711, "indices": [3, 14], "name": "Dave Chang", |