Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Pig duplicate records


Copy link to this message
-
Re: Pig duplicate records
Scott Carey 2011-09-21, 20:55
You will want to ask the pig user mailing list this question.

org.apache.pig.piggybank.storage.avro.AvroStorage is maintained by the Pig
project and you will get more help from there.

On 9/21/11 4:34 AM, "Alex Holmes" <[EMAIL PROTECTED]> wrote:

>Hi all,
>
>I have a simple schema
>
>{"name": "Record", "type": "record",
>  "fields": [
>    {"name": "name", "type": "string"},
>    {"name": "id", "type": "int"}
>  ]
>}
>
>which I use to write 2 records to an Avro file, and my reader code
>(which reads the file and dumps the records) verifies that there are 2
>records in the file:
>
>Record@1e9e5c73[name=r1,id=1]
>Record@ed42d08[name=r2,id=2]
>
>When using this file with pig and AvroStorage, pig seems to think
>there are 4 records:
>
>grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar;
>grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
>grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar;
>grunt> REGISTER
>/app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar;
>grunt> REGISTER
>/app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar;
>grunt> raw = LOAD 'test.v1.avro' USING
>org.apache.pig.piggybank.storage.avro.AvroStorage;
>grunt> dump raw;
>..
>Input(s):
>Successfully read 4 records (825 bytes) from:
>"hdfs://localhost:9000/user/aholmes/test.v1.avro"
>
>Output(s):
>Successfully stored 4 records (46 bytes) in:
>"hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"
>
>Counters:
>Total records written : 4
>Total bytes written : 46
>..
>(r1,1)
>(r2,2)
>(r1,1)
>(r2,2)
>
>I'm sure I'm doing something wrong (again)!
>
>Many thanks,
>Alex