Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> LOAD multiple files with glob


Copy link to this message
-
Re: LOAD multiple files with glob
Hello,

The schema is displayed by describe when i run it like this:

--------------------------------------------
REGISTER 'hdfs:///lib/avro-1.7.2.jar';
REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
REGISTER 'hdfs:///lib/piggybank.jar';

DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

avro = load '/data/2012/trace_ejb3/2012-01-02.avro' USING
AvroStorage();

describe avro;
---------------------------------------------
$ pig avro-test.pig
<snip>
avro: {id: long,timestamp: long,latitude: int,longitude: int,speed:
int,heading: int,terminalid: int,customerid: chararray,mileage:
int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id:
long,value: chararray,pkey: chararray)}}
21:08:46  centos6-hadoop-hishiru  ~ $
The actual schema as used by Python to create those files is:

{
"type": "record",
"name": "trace",
"namespace": "asp",
"fields": [
{   "name": "id"   , "type": "long"   },
{   "name": "timestamp" , "type": "long"  },
{   "name": "latitude", "type": ["int","null"]   },
{   "name": "longitude", "type": ["int","null"]   },
{   "name": "speed", "type": ["int","null"]   },
{   "name": "heading", "type": ["int","null"]   },
{   "name": "terminalid", "type": "int"   },
{   "name": "customerid", "type": "string"   },
{   "name": "mileage", "type": ["int","null"]   },
{   "name": "creationtime", "type": "long"   },
{   "name": "tracetype", "type": "int"   },
{   "name": "traceproperties", "type": {
"type": "array",
"items": {
"name": "traceproperty",
"type": "record",
"fields": [
{ "name": "id", "type": "long" },
{ "name": "value", "type": "string" },
{ "name": "pkey", "type": "string" }
]
}
}
}
]
}

Thanks!

Kind regards,

Bart

Cheolsoo Park schreef op 25.11.2012 15:33:
> Hi Bart,
>
> avro = load '/data/2012/trace_ejb3/2012-**01-*.avro' USING
> AvroStorage();
> gives me:
> Schema for avro unknown.
>
> This should work. The error that you're getting is not from
> AvroStorage but
> PigServer.
>
> grep -r "Schema for .* unknown" *
> src/org/apache/pig/PigServer.java:
>  System.out.println("Schema for " + alias + " unknown.");
> ...
>
> It looks like that you have an error in your Pig script. Can you
> please
> provide your Pig script and the schema of your avro files that
> reproduce
> the error?
>
> Thanks,
> Cheolsoo
>
>
> On Sun, Nov 25, 2012 at 1:02 AM, Bart Verwilst <[EMAIL PROTECTED]>
> wrote:
>
>> Hi,
>>
>> I've tried loading a csv with PigStorage(), getting this:
>>
>>
>> txt = load '/import.mysql/trace_ejb3_**2011/part-m-00000' USING
>> PigStorage(',');
>> describe txt;
>>
>> Schema for txt unknown.
>>
>> Maybe this is because of it being a csv, so a schema is hard to
>> figure
>> out..
>>
>> Any other suggestions? Our whole hadoop setup is built around being
>> able
>> to selectively load avro files to run our jobs on, if this doesn't
>> work
>> then we're pretty much screwed.. :)
>>
>> Thanks in advance!
>>
>> Bart
>>
>> Russell Jurney schreef op 24.11.2012 20:23:
>>
>>  I suspect the problem is AvroStorage, not globbing. Try this with
>>> pigstorage.
>>>
>>> Russell Jurney twitter.com/rjurney
>>>
>>>
>>> On Nov 24, 2012, at 5:15 AM, Bart Verwilst <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>  Hello,
>>>>
>>>> Thanks for your suggestion!
>>>> I switch my avro variable to avro = load '$INPUT' USING
>>>> AvroStorage();
>>>>
>>>> However I get the same results this way:
>>>>
>>>> $ pig -p INPUT=/data/2012/trace_ejb3/**2012-01-02.avro
>>>> avro-test.pig
>>>> which: no hbase in (:/usr/lib64/qt-3.3/bin:/usr/**
>>>> java/jdk1.6.0_33/bin/:/usr/**local/bin:/bin:/usr/bin:/usr/**
>>>> local/sbin:/usr/sbin:/sbin:/**usr/local/bin)
>>>> <snip>
>>>> avro: {id: long,timestamp: long,latitude: int,longitude:
>>>> int,speed:
>>>> int,heading: int,terminalid: int,customerid: chararray,mileage:
>>>> int,creationtime: long,tracetype: int,traceproperties:
>>>> {ARRAY_ELEM: (id:
>>>> long,value: chararray,pkey: chararray)}}
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB