Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Passing schema inside Load functionc


Copy link to this message
-
Re: Passing schema inside Load functionc
Hi Praveenesh,

Maybe this will get you started.

Suppose we have the desired schema parsed and stored in 'map' of type
LinkedHashMap<String, String>.  The key is your field name, and the
value denotes the data type, e.g., 'string', 'int',
etc.

Now, let's derive pig's schema from this map:

Schema schema = new Schema();  // pig schema

for (Entry<String, String> entry : map.entrySet()) {
    schema.add(new Schema.FieldSchema(entry.getKey(),
getPigType(entry.getValue())));
}

where getPigType returns the corresponding pig's data type:

       byte getPigType(String fieldType) {
if (fieldType.equalsIgnoreCase("string")) {
return DataType.CHARARRAY;
} else if (fieldType.equalsIgnoreCase("int")) {
return DataType.INTEGER;
} else if (fieldType.equalsIgnoreCase("long")) {
return DataType.LONG;
} else if (fieldType.equalsIgnoreCase("float")) {
return DataType.FLOAT;
} if (fieldType.equalsIgnoreCase("double")) {
return DataType.DOUBLE;
} if (fieldType.equalsIgnoreCase("boolean")) {
return DataType.BOOLEAN;
} else {
return DataType.CHARARRAY;
}
}
Now, you'll want to implement 'getSchema' in your custom loader:

@Override
public ResourceSchema getSchema(String location, Job job) throws IOException {
     return new ResourceSchema(schema); // I'd actually cache this
result if the schema is fixed
}

This should take care of the schema except you'd probably also need to
serialize it to the  back-end so that
you can enforce the schema inside 'getNext'.

stan

P.S. The above is essentially pseudo-code; I haven't actually type-checked it.

On Fri, Feb 3, 2012 at 5:45 PM, praveenesh kumar <[EMAIL PROTECTED]> wrote:
> Thanks Stan,
> I was going through these only. I was wondering is there a easy way to do
> it or am I reading something wrong.
> Now I will focus on what you have suggested. but I hope there is some easy
> solution to my problem
>
> Praveenesh
>
> On Sat, Feb 4, 2012 at 4:12 AM, Stan Rosenberg <
> [EMAIL PROTECTED]> wrote:
>
>> Hi Praveenesh,
>>
>> Assuming you have already read these:
>>
>> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
>> http://pig.apache.org/docs/r0.9.2/udf.html#load-store-functions
>>
>> my next step would be to peruse the source code of some existing
>> loaders, e.g., PigStorage.
>>
>> Best,
>>
>> stan
>>
>>
>> On Fri, Feb 3, 2012 at 5:35 PM, praveenesh kumar <[EMAIL PROTECTED]>
>> wrote:
>> > Thanks Stan,
>> > If you were facing this kind of scenario, how would you have proceeded ?
>> > Can you give me some pointers on how to write custom loader, some good
>> > tutorials..on it
>> > What is the current practice in order to solve the above scenario in pig
>> ?
>> >
>> > Praveenesh
>> >
>> >
>> > On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg <
>> > [EMAIL PROTECTED]> wrote:
>> >
>> >> My hunch is you'll have to write a custom loader, but I'll let the
>> >> experts chime in.  E.g., AvroStorage loader can parse the schema
>> >> from a json file passed to it via the constructor.  I don't think
>> >> PigStorage has the same option.
>> >>
>> >> stan
>> >>
>> >> On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar <[EMAIL PROTECTED]>
>> >> wrote:
>> >> > Hey guys,
>> >> >
>> >> > I am new to Pig.
>> >> > I was wondering is it possible to pass schema in pig load statement
>> while
>> >> > loading it first time.
>> >> >
>> >> > Suppose if I have a huge dataset.. containing around 100 cols.. Is
>> there
>> >> a
>> >> > way through which I can pass the schema defined in some other file
>> (some
>> >> > kind of meta file) into pig load statement or do I have to define it
>> >> every
>> >> > time inside LOAD statement ?
>> >> >
>> >> > Thanks,
>> >> > Praveenesh
>> >>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB