Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
Bart Verwilst 2012-11-25, 20:14
Hello,

The schema is displayed by describe when i run it like this:

--------------------------------------------
REGISTER 'hdfs:///lib/avro-1.7.2.jar';
REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
REGISTER 'hdfs:///lib/piggybank.jar';

DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

avro = load '/data/2012/trace_ejb3/2012-01-02.avro' USING
AvroStorage();

describe avro;
---------------------------------------------
$ pig avro-test.pig
<snip>
avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
int,heading: int,terminalid: int,customerid: chararray,mileage:
int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id:
long,value: chararray,pkey: chararray)}}
21:08:46  centos6-hadoop-hishiru  ~ $
The actual schema as used by Python to create those files is:

{
"type": "record",
"name": "trace",
"namespace": "asp",
"fields": [
{   "name": "id"   , "type": "long"   },
{   "name": "timestamp" , "type": "long"  },
{   "name": "latitude", "type": ["int","null"]   },
{   "name": "longitude", "type": ["int","null"]   },
{   "name": "speed", "type": ["int","null"]   },
{   "name": "heading", "type": ["int","null"]   },
{   "name": "terminalid", "type": "int"   },
{   "name": "customerid", "type": "string"   },
{   "name": "mileage", "type": ["int","null"]   },
{   "name": "creationtime", "type": "long"   },
{   "name": "tracetype", "type": "int"   },
{   "name": "traceproperties", "type": {
"type": "array",
"items": {
"name": "traceproperty",
"type": "record",
"fields": [
{ "name": "id", "type": "long" },
{ "name": "value", "type": "string" },
{ "name": "pkey", "type": "string" }
]
}
}
}
]
}

Thanks!

Kind regards,

Bart

Cheolsoo Park schreef op 25.11.2012 15:33:
> Hi Bart,
>
> avro = load '/data/2012/trace_ejb3/2012-**01-*.avro' USING
> AvroStorage();
> gives me:
> Schema for avro unknown.
>
> This should work. The error that you're getting is not from
> AvroStorage but
> PigServer.
>
> grep -r "Schema for .* unknown" *
> src/org/apache/pig/PigServer.java:
>  System.out.println("Schema for " + alias + " unknown.");
> ...
>
> It looks like that you have an error in your Pig script. Can you
> please
> provide your Pig script and the schema of your avro files that
> reproduce
> the error?
>
> Thanks,
> Cheolsoo
>
>
> On Sun, Nov 25, 2012 at 1:02 AM, Bart Verwilst <[EMAIL PROTECTED]>
> wrote:
>
>> Hi,
>>
>> I've tried loading a csv with PigStorage(), getting this:
>>
>>
>> txt = load '/import.mysql/trace_ejb3_**2011/part-m-00000' USING
>> PigStorage(',');
>> describe txt;
>>
>> Schema for txt unknown.
>>
>> Maybe this is because of it being a csv, so a schema is hard to
>> figure
>> out..
>>
>> Any other suggestions? Our whole hadoop setup is built around being
>> able
>> to selectively load avro files to run our jobs on, if this doesn't
>> work
>> then we're pretty much screwed.. :)
>>
>> Thanks in advance!
>>
>> Bart
>>
>> Russell Jurney schreef op 24.11.2012 20:23:
>>
>>  I suspect the problem is AvroStorage, not globbing. Try this with
>>> pigstorage.
>>>
>>> Russell Jurney twitter.com/rjurney
>>>
>>>
>>> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>  Hello,
>>>>
>>>> Thanks for your suggestion!
>>>> I switch my avro variable to avro = load '$INPUT' USING
>>>> AvroStorage();
>>>>
>>>> However I get the same results this way:
>>>>
>>>> $ pig -p INPUT=/data/2012/trace_ejb3/**2012-01-02.avro
>>>> avro-test.pig
>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**
>>>> java/jdk1.6.0_33/bin/:/usr/**local/bin:/bin:/usr/bin:/usr/**
>>>> local/sbin:/usr/sbin:/sbin:/**usr/local/bin)
>>>> <snip>
>>>> avro: {id: long,timestamp: long,latitude: int,longitude:
>>>> int,speed:
>>>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>>>> int,creationtime: long,tracetype: int,traceproperties:
>>>> {ARRAY_ELEM: (id:
>>>> long,value: chararray,pkey: chararray)}}