Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
It's a total rewrite, so it hasn't exactly "made it in."

But yes, file globs should work correctly. That's one of the unit tests.
(All of the unit tests pass, incidentally.)
On Mon, Nov 26, 2012 at 10:23 AM, Russell Jurney
<[EMAIL PROTECTED]>wrote:

> Is the globbing feature making it into the AvroStorage rewrite?
>
> Russell Jurney twitter.com/rjurney
>
>
> On Nov 26, 2012, at 7:50 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:
>
> > To answer myself again, I compiled Pig 0.11 and Piggybank, and it's
> working very well now, globbing seems to be fully supported!
> >
> > Bart Verwilst schreef op 26.11.2012 15:33:
> >> To answer myself, could this be part of the solution? :
> >>
> >> https://issues.apache.org/jira/browse/PIG-2492
> >>
> >> Guess I'll have to wait for 0.11 then?
> >>
> >> Bart Verwilst schreef op 26.11.2012 14:19:
> >>> 14:16:08  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
> >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
> >>> REGISTER 'hdfs:///lib/piggybank.jar';
> >>>
> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> >>> avro = load '/test/*' USING AvroStorage();
> >>> describe avro;
> >>>
> >>> 14:16:09  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
> >>> Schema for avro unknown.
> >>>
> >>> 14:16:17  centos6-hadoop-hishiru  ~ $ vim avro-test.pig
> >>>
> >>> 14:16:25  centos6-hadoop-hishiru  ~ $ cat avro-test.pig
> >>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> >>> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
> >>> REGISTER 'hdfs:///lib/piggybank.jar';
> >>>
> >>> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> >>> avro = load '/test/2012-11-25.avro' USING AvroStorage();
> >>> describe avro;
> >>>
> >>> 14:16:30  centos6-hadoop-hishiru  ~ $ pig avro-test.pig
> >>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
> >>> int,heading: int,terminalid: int,customerid: chararray,mileage:
> >>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
> >>> (id: long,value: chararray,pkey: chararray)}}
> >>>
> >>> 14:16:55  centos6-hadoop-hishiru  ~ $ hadoop fs -ls /test/
> >>> Found 1 items
> >>> -rw-r--r--   3 hdfs supergroup   63140500 2012-11-26 14:13
> /test/2012-11-25.avro
> >>>
> >>> Cheolsoo Park schreef op 26.11.2012 10:45:
> >>>> Hi,
> >>>>
> >>>>>> Invalid field projection. Projected field [tracetype] does not
> exist.
> >>>>
> >>>> The error indicates that the "tracetype" doesn't exist in the Pig
> schema of
> >>>> the relation "avro". What AvroStorage does is to automatically
> convert Avro
> >>>> schema to Pig schema during the load. Although you have "tracetype"
> in your
> >>>> Avro schema, "tracetype" doesn't exist in the generated Pig schema for
> >>>> whatever reason.
> >>>>
> >>>> Can you please try to "describe avro"? You can replace group and dump
> >>>> commands with describe in your Pig script. This will show you what
> the Pig
> >>>> schema of "avro" is. If "tracetype" indeed doesn't exist, you have to
> find
> >>>> out why it doesn't. It could be because the schema of .avro files is
> not
> >>>> the same or because there is a bug in AvroStorage, etc.
> >>>>
> >>>>>> Maybe globbing with [] doesnt work, but wildcard works?
> >>>>
> >>>> You're right. AvroStorage internally uses Hadoop path globing, and
> Hadoop
> >>>> path globing doesn't support '[ ]'. But the above error (Projected
> field
> >>>> [tracetype] does not exist) is not because of this.
> URISyntaxException is
> >>>> what you will get because of '[ ]'.
> >>>>
> >>>> Thanks,
> >>>> Cheolsoo
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Nov 25, 2012 at 10:25 AM, Bart Verwilst <[EMAIL PROTECTED]>
> wrote:
> >>>>
> >>>>> Just tried this:
> >>>>>
> >>>>>
> >>>>> ------------------------------**----------------------
> >>>>> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
> >>>>> REGISTER 'hdfs:///lib/json-simple-1.1.**1.jar';
> >>>>> REGISTER 'hdfs:///lib/piggybank.jar';
> >>>>>
> >>>>> DEFINE AvroStorage
> org.apache.pig.piggybank.**storage.avro.AvroStorage();