Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
Bart Verwilst 2012-11-26, 15:50
To answer myself again, I compiled Pig 0.11 and Piggybank, and it's
working very well now, globbing seems to be fully supported!

Bart Verwilst schreef op 26.11.2012 15:33:
> To answer myself, could this be part of the solution? :
>
> https://issues.apache.org/jira/browse/PIG-2492
>
> Guess I'll have to wait for 0.11 then?
>
> Bart Verwilst schreef op 26.11.2012 14:19:
>> 14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>> REGISTER 'hdfs:///lib/piggybank.jar';
>>
>> DEFINE AvroStorage
>> org.apache.pig.piggybank.storage.avro.AvroStorage();
>> avro = load '/test/*' USING AvroStorage();
>> describe avro;
>>
>> 14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>> Schema for avro unknown.
>>
>> 14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
>>
>> 14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>> REGISTER 'hdfs:///lib/piggybank.jar';
>>
>> DEFINE AvroStorage
>> org.apache.pig.piggybank.storage.avro.AvroStorage();
>> avro = load '/test/2012-11-25.avro' USING AvroStorage();
>> describe avro;
>>
>> 14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
>> (id: long,value: chararray,pkey: chararray)}}
>>
>> 14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
>> Found 1 items
>> -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13
>> /test/2012-11-25.avro
>>
>> Cheolsoo Park schreef op 26.11.2012 10:45:
>>> Hi,
>>>
>>>>> Invalid field projection. Projected field [tracetype] does not
>>>>> exist.
>>>
>>> The error indicates that the "tracetype" doesn't exist in the Pig
>>> schema of
>>> the relation "avro". What AvroStorage does is to automatically
>>> convert Avro
>>> schema to Pig schema during the load. Although you have "tracetype"
>>> in your
>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema
>>> for
>>> whatever reason.
>>>
>>> Can you please try to "describe avro"? You can replace group and
>>> dump
>>> commands with describe in your Pig script. This will show you what
>>> the Pig
>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have
>>> to find
>>> out why it doesn't. It could be because the schema of .avro files
>>> is not
>>> the same or because there is a bug in AvroStorage, etc.
>>>
>>>>> Maybe globbing with [] doesnt work, but wildcard works?
>>>
>>> You're right. AvroStorage internally uses Hadoop path globing, and
>>> Hadoop
>>> path globing doesn't support '[ ]'. But the above error (Projected
>>> field
>>> [tracetype] does not exist) is not because of this.
>>> URISyntaxException is
>>> what you will get because of '[ ]'.
>>>
>>> Thanks,
>>> Cheolsoo
>>>
>>>
>>>
>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Just tried this:
>>>>
>>>>
>>>> ------------------------------**----------------------
>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
>>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>>
>>>> DEFINE AvroStorage
>>>> org.apache.pig.piggybank.**storage.avro.AvroStorage();
>>>>
>>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING
>>>> AvroStorage();
>>>>
>>>> groups = group avro by tracetype;
>>>>
>>>> dump groups;
>>>> ------------------------------**----------------------
>>>>
>>>> gave me:
>>>>
>>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>>> Projected field [tracetype] does not exist.
>>>>
>>>> Pig Stack Trace
>>>> ---------------
>>>> ERROR 1025:
>>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>>> Projected field [tracetype] does not exist.
>>>>
>>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR