Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> AvroStorage and duplicate records


Copy link to this message
-
Re: AvroStorage and duplicate records
I am not an avro expert, but one thing to check would be the avro
versions used to write the data is compatible with the version used with
pig to read the data.
-Thejas
On 9/26/11 3:05 AM, Alex Holmes wrote:
> Hi,
>
> Anyone have any idea what could be up?  Is the Avro support in Pig not
> ready for prime time yet?
>
> Thanks,
> Alex
>
>
> On Wed, Sep 21, 2011 at 8:01 PM, Alex Holmes<[EMAIL PROTECTED]>  wrote:
>> Hi all,
>>
>> I have a simple schema
>>
>> {"name": "Record", "type": "record",
>>   "fields": [
>>    {"name": "name", "type": "string"},
>>    {"name": "id", "type": "int"}
>>   ]
>> }
>>
>> which I use to write 2 records to an Avro file with the following code:
>>
>>   public static Record createRecord(String name, int id) {
>>     Record record = new Record();
>>     record.name = name;
>>     record.id = id;
>>     return record;
>>   }
>>
>>   public static void writeToAvro(OutputStream outputStream)
>>       throws IOException {
>>     DataFileWriter<Record>  writer >>         new DataFileWriter<Record>(new SpecificDatumWriter<Record>());
>>     writer.create(Record.SCHEMA$, outputStream);
>>
>>     writer.append(createRecord("r1", 1));
>>     writer.append(createRecord("r2", 2));
>>
>>     writer.close();
>>     outputStream.close();
>>   }
>>
>> I also have some reader code which reads in the file and just dumps
>> the contents of each Record:
>>
>>     DataFileStream<Record>  reader = new DataFileStream<Record>(
>>             is, new SpecificDatumReader<Record>(Record.SCHEMA$));
>>     for (Record a : reader) {
>>       System.out.println(ToStringBuilder.reflectionToString(a));
>>     }
>>
>> Its output is:
>>
>> Record@1e9e5c73[name=r1,id=1]
>> Record@ed42d08[name=r2,id=2]
>>
>> When using this file with pig and AvroStorage, pig seems to think
>> there are 4 records:
>>
>> grunt>  REGISTER /app/hadoop/lib/avro-1.5.4.jar;
>> grunt>  REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
>> grunt>  REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar;
>> grunt>  REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar;
>> grunt>  REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar;
>> grunt>  raw = LOAD 'test.v1.avro' USING
>> org.apache.pig.piggybank.storage.avro.AvroStorage;
>> grunt>  dump raw;
>> ..
>> Input(s):
>> Successfully read 4 records (825 bytes) from:
>> "hdfs://localhost:9000/user/aholmes/test.v1.avro"
>>
>> Output(s):
>> Successfully stored 4 records (46 bytes) in:
>> "hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"
>>
>> Counters:
>> Total records written : 4
>> Total bytes written : 46
>> ..
>> (r1,1)
>> (r2,2)
>> (r1,1)
>> (r2,2)
>>
>> I'm sure I'm doing something wrong, but would appreciate any help.
>>
>> Many thanks,
>> Alex
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB