Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
Hi,

I've tried loading a csv with PigStorage(), getting this:
txt = load '/import.mysql/trace_ejb3_2011/part-m-00000' USING
PigStorage(',');
describe txt;

Schema for txt unknown.

Maybe this is because of it being a csv, so a schema is hard to figure
out..

Any other suggestions? Our whole hadoop setup is built around being
able to selectively load avro files to run our jobs on, if this doesn't
work then we're pretty much screwed.. :)

Thanks in advance!

Bart

Russell Jurney schreef op 24.11.2012 20:23:
> I suspect the problem is AvroStorage, not globbing. Try this with
> pigstorage.
>
> Russell Jurney twitter.com/rjurney
>
>
> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[EMAIL PROTECTED]> wrote:
>
>> Hello,
>>
>> Thanks for your suggestion!
>> I switch my avro variable to avro = load '$INPUT' USING
>> AvroStorage();
>>
>> However I get the same results this way:
>>
>> $ pig -p INPUT=/data/2012/trace_ejb3/2012-01-02.avro avro-test.pig
>> which: no hbase in
>> (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin)
>> <snip>
>> avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>> int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM:
>> (id: long,value: chararray,pkey: chararray)}}
>>
>>
>> $ pig -p INPUT="/data/2012/trace_ejb3/2012-01-0[12].avro"
>> avro-test.pig
>> which: no hbase in
>> (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin)
>> <snip>
>> 2012-11-24 14:11:17,309 [main] ERROR
>> org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
>> error. null
>> Caused by: java.net.URISyntaxException: Illegal character in path at
>> index 31: /data/2012/trace_ejb3/2012-01-0[12].avro
>>
>>
>> $ pig -p INPUT='/data/2012/trace_ejb3/2012-01-0[12].avro'
>> avro-test.pig
>> which: no hbase in
>> (:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.6.0_33/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin)
>> <snip>
>> 2012-11-24 14:12:05,085 [main] ERROR
>> org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal
>> error. null
>> Details at logfile: /var/lib/hadoop-hdfs/pig_1353762722742.log
>> Caused by: java.net.URISyntaxException: Illegal character in path at
>> index 31: /data/2012/trace_ejb3/2012-01-0[12].avro
>>
>>
>> Deepak Tiwari schreef op 24.11.2012 00:41:
>>> Hi,
>>>
>>> I dont have a system to test it right now, but I have been passing
>>> it using
>>> under parameter -p and it works.
>>>
>>> change line to  accept parameters like         avro = load '$INPUT'
>>> USING
>>> AvroStorage();
>>>
>>> bin/pig -p INPUT="/data/2012/trace_ejb3/2012-**01-0[12].avro"
>>> <scriptName>
>>>
>>> I think if you dont give double quotes then the expansion is done
>>> by OS.
>>>
>>> Please let us know if it doesnt work...
>>>
>>>
>>>
>>> On Fri, Nov 23, 2012 at 12:45 PM, Bart Verwilst <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have the following files on HDFS:
>>>>
>>>> -rw-r--r--   3 hdfs supergroup   22989179 2012-11-22 11:17
>>>> /data/2012/trace_ejb3/2012-01-**01.avro
>>>> -rw-r--r--   3 hdfs supergroup  240551819 2012-11-22 14:27
>>>> /data/2012/trace_ejb3/2012-01-**02.avro
>>>> -rw-r--r--   3 hdfs supergroup  324464635 2012-11-22 18:28
>>>> /data/2012/trace_ejb3/2012-01-**03.avro
>>>> -rw-r--r--   3 hdfs supergroup  345526418 2012-11-22 21:30
>>>> /data/2012/trace_ejb3/2012-01-**04.avro
>>>> -rw-r--r--   3 hdfs supergroup  351322916 2012-11-23 00:28
>>>> /data/2012/trace_ejb3/2012-01-**05.avro
>>>> -rw-r--r--   3 hdfs supergroup  325953043 2012-11-23 04:32
>>>> /data/2012/trace_ejb3/2012-01-**06.avro
>>>> -rw-r--r--   3 hdfs supergroup  107019156 2012-11-23 05:58
>>>> /data/2012/trace_ejb3/2012-01-**07.avro
>>>> -rw-r--r--   3 hdfs supergroup   46392850 2012-11-23 06:37
>>>> /data/2012/trace_ejb3/2012-01-**08.avro
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB