Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # dev - Re: LOAD multiple files with glob


+
Russell Jurney 2012-11-26, 18:23
Copy link to this message
-
Re: LOAD multiple files with glob
Cheolsoo Park 2012-11-26, 18:59
Yes, it is. Joe has unit test cases for path globbing in his patch:
https://reviews.apache.org/r/8104/diff/#index_header

On Mon, Nov 26, 2012 at 8:23 AM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> Is the globbing feature making it into the AvroStorage rewrite?
>
> Russell Jurney twitter.com/rjurney
>
>
> On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:
>
> > To answer myself again, I compiled Pig 0.11 and Piggybank, and it's
> working very well now, globbing seems to be fully supported!
> >
> > Bart Verwilst schreef op 26.11.2012 15:33:
> >> To answer myself, could this be part of the solution? :
> >>
> >> https://issues.apache.org/jira/browse/PIG-2492
> >>
> >> Guess I'll have to wait for 0.11 then?
> >>
> >> Bart Verwilst schreef op 26.11.2012 14:19:
> >>> 14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
> >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
> >>> REGISTER 'hdfs:///lib/piggybank.jar';
> >>>
> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> >>> avro = load '/test/*' USING AvroStorage();
> >>> describe avro;
> >>>
> >>> 14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
> >>> Schema for avro unknown.
> >>>
> >>> 14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
> >>>
> >>> 14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
> >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
> >>> REGISTER 'hdfs:///lib/piggybank.jar';
> >>>
> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> >>> avro = load '/test/2012-11-25.avro' USING AvroStorage();
> >>> describe avro;
> >>>
> >>> 14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
> >>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
> >>> int,heading: int,terminalid: int,customerid: chararray,mileage:
> >>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
> >>> (id: long,value: chararray,pkey: chararray)}}
> >>>
> >>> 14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
> >>> Found 1 items
> >>> -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13
> /test/2012-11-25.avro
> >>>
> >>> Cheolsoo Park schreef op 26.11.2012 10:45:
> >>>> Hi,
> >>>>
> >>>>>> Invalid field projection. Projected field [tracetype] does not
> exist.
> >>>>
> >>>> The error indicates that the "tracetype" doesn't exist in the Pig
> schema of
> >>>> the relation "avro". What AvroStorage does is to automatically
> convert Avro
> >>>> schema to Pig schema during the load. Although you have "tracetype"
> in your
> >>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for
> >>>> whatever reason.
> >>>>
> >>>> Can you please try to "describe avro"? You can replace group and dump
> >>>> commands with describe in your Pig script. This will show you what
> the Pig
> >>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to
> find
> >>>> out why it doesn't. It could be because the schema of .avro files is
> not
> >>>> the same or because there is a bug in AvroStorage, etc.
> >>>>
> >>>>>> Maybe globbing with [] doesnt work, but wildcard works?
> >>>>
> >>>> You're right. AvroStorage internally uses Hadoop path globing, and
> Hadoop
> >>>> path globing doesn't support '[ ]'. But the above error (Projected
> field
> >>>> [tracetype] does not exist) is not because of this.
> URISyntaxException is
> >>>> what you will get because of '[ ]'.
> >>>>
> >>>> Thanks,
> >>>> Cheolsoo
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]>
> wrote:
> >>>>
> >>>>> Just tried this:
> >>>>>
> >>>>>
> >>>>> ------------------------------**----------------------
> >>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> >>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
> >>>>> REGISTER 'hdfs:///lib/piggybank.jar';
> >>>>>
> >>>>> DEFINE AvroStorage
> org.apache.pig.piggybank.**storage.avro.AvroStorage();
> >>>>
+
Joseph Adler 2012-11-26, 18:37
+
Russell Jurney 2012-11-26, 20:36