Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # dev - Re: LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
Russell Jurney 2012-11-26, 18:23
Is the globbing feature making it into the AvroStorage rewrite?

Russell Jurney twitter.com/rjurney
On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:

> To answer myself again, I compiled Pig 0.11 and Piggybank, and it's working very well now, globbing seems to be fully supported!
>
> Bart Verwilst schreef op 26.11.2012 15:33:
>> To answer myself, could this be part of the solution? :
>>
>> https://issues.apache.org/jira/browse/PIG-2492
>>
>> Guess I'll have to wait for 0.11 then?
>>
>> Bart Verwilst schreef op 26.11.2012 14:19:
>>> 14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>
>>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> avro = load '/test/*' USING AvroStorage();
>>> describe avro;
>>>
>>> 14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>>> Schema for avro unknown.
>>>
>>> 14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
>>>
>>> 14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>
>>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> avro = load '/test/2012-11-25.avro' USING AvroStorage();
>>> describe avro;
>>>
>>> 14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
>>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
>>> (id: long,value: chararray,pkey: chararray)}}
>>>
>>> 14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
>>> Found 1 items
>>> -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13 /test/2012-11-25.avro
>>>
>>> Cheolsoo Park schreef op 26.11.2012 10:45:
>>>> Hi,
>>>>
>>>>>> Invalid field projection. Projected field [tracetype] does not exist.
>>>>
>>>> The error indicates that the "tracetype" doesn't exist in the Pig schema of
>>>> the relation "avro". What AvroStorage does is to automatically convert Avro
>>>> schema to Pig schema during the load. Although you have "tracetype" in your
>>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for
>>>> whatever reason.
>>>>
>>>> Can you please try to "describe avro"? You can replace group and dump
>>>> commands with describe in your Pig script. This will show you what the Pig
>>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to find
>>>> out why it doesn't. It could be because the schema of .avro files is not
>>>> the same or because there is a bug in AvroStorage, etc.
>>>>
>>>>>> Maybe globbing with [] doesnt work, but wildcard works?
>>>>
>>>> You're right. AvroStorage internally uses Hadoop path globing, and Hadoop
>>>> path globing doesn't support '[ ]'. But the above error (Projected field
>>>> [tracetype] does not exist) is not because of this. URISyntaxException is
>>>> what you will get because of '[ ]'.
>>>>
>>>> Thanks,
>>>> Cheolsoo
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Just tried this:
>>>>>
>>>>>
>>>>> ------------------------------**----------------------
>>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
>>>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>>>
>>>>> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage();
>>>>>
>>>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING AvroStorage();
>>>>>
>>>>> groups = group avro by tracetype;
>>>>>
>>>>> dump groups;
>>>>> ------------------------------**----------------------
>>>>>
>>>>> gave me:
>>>>>
>>>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>>>> Projected field [tracetype] does not exist.
>>>>>
>>>>> Pig Stack Trace
>>>>> -------------
+
Cheolsoo Park 2012-11-26, 18:59
+
Joseph Adler 2012-11-26, 18:37
+
Russell Jurney 2012-11-26, 20:36