Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
Is the globbing feature making it into the AvroStorage rewrite?

Russell Jurney twitter.com/rjurney
On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:

> To answer myself again, I compiled Pig 0.11 and Piggybank, and it's working very well now, globbing seems to be fully supported!
>
> Bart Verwilst schreef op 26.11.2012 15:33:
>> To answer myself, could this be part of the solution? :
>>
>> https://issues.apache.org/jira/browse/PIG-2492
>>
>> Guess I'll have to wait for 0.11 then?
>>
>> Bart Verwilst schreef op 26.11.2012 14:19:
>>> 14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>
>>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> avro = load '/test/*' USING AvroStorage();
>>> describe avro;
>>>
>>> 14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>>> Schema for avro unknown.
>>>
>>> 14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
>>>
>>> 14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>
>>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> avro = load '/test/2012-11-25.avro' USING AvroStorage();
>>> describe avro;
>>>
>>> 14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
>>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
>>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
>>> (id: long,value: chararray,pkey: chararray)}}
>>>
>>> 14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
>>> Found 1 items
>>> -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13 /test/2012-11-25.avro
>>>
>>> Cheolsoo Park schreef op 26.11.2012 10:45:
>>>> Hi,
>>>>
>>>>>> Invalid field projection. Projected field [tracetype] does not exist.
>>>>
>>>> The error indicates that the "tracetype" doesn't exist in the Pig schema of
>>>> the relation "avro". What AvroStorage does is to automatically convert Avro
>>>> schema to Pig schema during the load. Although you have "tracetype" in your
>>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for
>>>> whatever reason.
>>>>
>>>> Can you please try to "describe avro"? You can replace group and dump
>>>> commands with describe in your Pig script. This will show you what the Pig
>>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to find
>>>> out why it doesn't. It could be because the schema of .avro files is not
>>>> the same or because there is a bug in AvroStorage, etc.
>>>>
>>>>>> Maybe globbing with [] doesnt work, but wildcard works?
>>>>
>>>> You're right. AvroStorage internally uses Hadoop path globing, and Hadoop
>>>> path globing doesn't support '[ ]'. But the above error (Projected field
>>>> [tracetype] does not exist) is not because of this. URISyntaxException is
>>>> what you will get because of '[ ]'.
>>>>
>>>> Thanks,
>>>> Cheolsoo
>>>>
>>>>
>>>>
>>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Just tried this:
>>>>>
>>>>>
>>>>> ------------------------------**----------------------
>>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
>>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
>>>>> REGISTER 'hdfs:///lib/piggybank.jar';
>>>>>
>>>>> DEFINE AvroStorage org.apache.pig.piggybank.**storage.avro.AvroStorage();
>>>>>
>>>>> avro = load '/data/2012/trace_ejb3/2012-**01-0*.avro' USING AvroStorage();
>>>>>
>>>>> groups = group avro by tracetype;
>>>>>
>>>>> dump groups;
>>>>> ------------------------------**----------------------
>>>>>
>>>>> gave me:
>>>>>
>>>>> <file avro-test.pig, line 10, column 23> Invalid field projection.
>>>>> Projected field [tracetype] does not exist.
>>>>>
>>>>> Pig Stack Trace
>>>>> -------------
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB