Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Passing schema inside Load functionc


Copy link to this message
-
Re: Passing schema inside Load functionc
Stan Rosenberg 2012-02-04, 02:40
Hi Praveenesh,

Maybe this will get you started.

Suppose we have the desired schema parsed and stored in 'map' of type
LinkedHashMap<String, String>.  The key is your field name, and the
value denotes the data type, e.g., 'string', 'int',
etc.

Now, let's derive pig's schema from this map:

Schema schema = new Schema();  // pig schema

for (Entry<String, String> entry : map.entrySet()) {
    schema.add(new Schema.FieldSchema(entry.getKey(),
getPigType(entry.getValue())));
}

where getPigType returns the corresponding pig's data type:

       byte getPigType(String fieldType) {
if (fieldType.equalsIgnoreCase("string")) {
return DataType.CHARARRAY;
} else if (fieldType.equalsIgnoreCase("int")) {
return DataType.INTEGER;
} else if (fieldType.equalsIgnoreCase("long")) {
return DataType.LONG;
} else if (fieldType.equalsIgnoreCase("float")) {
return DataType.FLOAT;
} if (fieldType.equalsIgnoreCase("double")) {
return DataType.DOUBLE;
} if (fieldType.equalsIgnoreCase("boolean")) {
return DataType.BOOLEAN;
} else {
return DataType.CHARARRAY;
}
}
Now, you'll want to implement 'getSchema' in your custom loader:

@Override
public ResourceSchema getSchema(String location, Job job) throws IOException {
     return new ResourceSchema(schema); // I'd actually cache this
result if the schema is fixed
}

This should take care of the schema except you'd probably also need to
serialize it to the  back-end so that
you can enforce the schema inside 'getNext'.

stan

P.S. The above is essentially pseudo-code; I haven't actually type-checked it.

On Fri, Feb 3, 2012 at 5:45 PM, praveenesh kumar <[EMAIL PROTECTED]> wrote:
> Thanks Stan,
> I was going through these only. I was wondering is there a easy way to do
> it or am I reading something wrong.
> Now I will focus on what you have suggested. but I hope there is some easy
> solution to my problem
>
> Praveenesh
>
> On Sat, Feb 4, 2012 at 4:12 AM, Stan Rosenberg <
> [EMAIL PROTECTED]> wrote:
>
>> Hi Praveenesh,
>>
>> Assuming you have already read these:
>>
>> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
>> http://pig.apache.org/docs/r0.9.2/udf.html#load-store-functions
>>
>> my next step would be to peruse the source code of some existing
>> loaders, e.g., PigStorage.
>>
>> Best,
>>
>> stan
>>
>>
>> On Fri, Feb 3, 2012 at 5:35 PM, praveenesh kumar <[EMAIL PROTECTED]>
>> wrote:
>> > Thanks Stan,
>> > If you were facing this kind of scenario, how would you have proceeded ?
>> > Can you give me some pointers on how to write custom loader, some good
>> > tutorials..on it
>> > What is the current practice in order to solve the above scenario in pig
>> ?
>> >
>> > Praveenesh
>> >
>> >
>> > On Sat, Feb 4, 2012 at 4:02 AM, Stan Rosenberg <
>> > [EMAIL PROTECTED]> wrote:
>> >
>> >> My hunch is you'll have to write a custom loader, but I'll let the
>> >> experts chime in.  E.g., AvroStorage loader can parse the schema
>> >> from a json file passed to it via the constructor.  I don't think
>> >> PigStorage has the same option.
>> >>
>> >> stan
>> >>
>> >> On Fri, Feb 3, 2012 at 7:35 AM, praveenesh kumar <[EMAIL PROTECTED]>
>> >> wrote:
>> >> > Hey guys,
>> >> >
>> >> > I am new to Pig.
>> >> > I was wondering is it possible to pass schema in pig load statement
>> while
>> >> > loading it first time.
>> >> >
>> >> > Suppose if I have a huge dataset.. containing around 100 cols.. Is
>> there
>> >> a
>> >> > way through which I can pass the schema defined in some other file
>> (some
>> >> > kind of meta file) into pig load statement or do I have to define it
>> >> every
>> >> > time inside LOAD statement ?
>> >> >
>> >> > Thanks,
>> >> > Praveenesh
>> >>
>>