Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
Bart Verwilst 2012-11-26, 14:33
To answer myself, could this be part of the solution? :

https://issues.apache.org/jira/browse/PIG-2492

Guess I'll have to wait for 0.11 then?

Bart Verwilst schreef op 26.11.2012 14:19:
> 14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
> REGISTER 'hdfs:///lib/piggybank.jar';
>
> DEFINE AvroStorage
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> avro = load '/test/*' USING AvroStorage();
> describe avro;
>
> 14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
> Schema for avro unknown.
>
> 14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
>
> 14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
> REGISTER 'hdfs:///lib/piggybank.jar';
>
> DEFINE AvroStorage
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> avro = load '/test/2012-11-25.avro' USING AvroStorage();
> describe avro;
>
> 14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
> int,heading: int,terminalid: int,customerid: chararray,mileage:
> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
> (id: long,value: chararray,pkey: chararray)}}
>
> 14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
> Found 1 items
> -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13
> /test/2012-11-25.avro
>
> Cheolsoo Park schreef op 26.11.2012 10:45:
>> Hi,
>>
>>>> Invalid field projection. Projected field [tracetype] does not
>>>> exist.
>>
>> The error indicates that the "tracetype" doesn't exist in the Pig
>> schema of
>> the relation "avro". What AvroStorage does is to automatically
>> convert Avro
>> schema to Pig schema during the load. Although you have "tracetype"
>> in your
>> Avro schema, "tracetype" doesn't exist in the generated Pig schema
>> for
>> whatever reason.
>>
>> Can you please try to "describe avro"? You can replace group and
>> dump
>> commands with describe in your Pig script. This will show you what
>> the Pig
>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have
>> to find
>> out why it doesn't. It could be because the schema of .avro files is
>> not
>> the same or because there is a bug in AvroStorage, etc.
>>
>>>> Maybe globbing with [] doesnt work, but wildcard works?
>>
>> You're right. AvroStorage internally uses Hadoop path globing, and
>> Hadoop
>> path globing doesn't support '[ ]'. But the above error (Projected
>> field
>> [tracetype] does not exist) is not because of this.
>> URISyntaxException is
>> what you will get because of '[ ]'.
>>
>> Thanks,
>> Cheolsoo
>>
>>
>>
>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]>
>> wrote:
>>
>>> Just tried this:
>>>
>>>
>>> ------------------------------**----------------------
>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>
>>> DEFINE AvroStorage
>>> org.apache.pig.piggybank.**storage.avro.AvroStorage();
>>>
>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING
>>> AvroStorage();
>>>
>>> groups = group avro by tracetype;
>>>
>>> dump groups;
>>> ------------------------------**----------------------
>>>
>>> gave me:
>>>
>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>> Projected field [tracetype] does not exist.
>>>
>>> Pig Stack Trace
>>> ---------------
>>> ERROR 1025:
>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>> Projected field [tracetype] does not exist.
>>>
>>> org.apache.pig.impl.**logicalLayer.**FrontendException: ERROR 1066:
>>> Unable to open iterator for alias groups
>>>         at
>>> org.apache.pig.PigServer.**openIterator(PigServer.java:**862)
>>>         at org.apache.pig.tools.grunt.**GruntParser.processDump(**
>>> GruntParser.java:682)
>>>         at org.apache.pig.tools.**pigscript.parser.**