Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> pig LoadMetaData find schema in AS clause from Loader.


+
Gerrit Jansen van Vuuren 2010-11-05, 15:13
+
Alan Gates 2010-11-10, 21:56
+
Gerrit Jansen van Vuuren 2010-11-11, 09:30
Copy link to this message
-
Re: pig LoadMetaData find schema in AS clause from Loader.
Yes, setPartitionFilter can be called only if pig knows the partition columns. Without knowing the partition columns the partition filter cannot be extracted.
If a user specifies a schema in the load statement, pig finds the partition columns by finding the position of columns returned by getPartitionKeys in the user defined schema, based on mapping of schema from getSchema() to user specified schema. Ie, pig assumes that the columns returned in getPartitionKeys() are columns in the schema returned in getSchema().

In your case, does getPartitionKeys return columns that are specified in the user defined schema ?

Yes, please open a jira, and lets discuss it there. I think at least javadoc might need to be updated

-Thejas

On 11/11/10 1:30 AM, "Gerrit Jansen van Vuuren" <[EMAIL PROTECTED]> wrote:

Hi,

I guess it should only call the setPartitionFilter when the
LoadMetadata:getPartitionKeys returns a none null value. Currently
getPartitionKeys is only called if the Loader returns a schema.
Should I create a Jira and try at proposing a fix to this?

Cheers,
 Gerrit
-----Original Message-----
From: Alan Gates [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 10, 2010 9:56 PM
To: [EMAIL PROTECTED]
Subject: Re: pig LoadMetaData find schema in AS clause from Loader.

To answer your direct question, no, there is currently no provision in
the interface for Pig to provide the user defined schema to the load
function.

But it seems like the real solution to your problem is that
LoadMetaData:setPartitionFilter ought to be called regardless of
whether the loader returns a schema.  Is there a technical reason we
don't do that?

Alan.

On Nov 5, 2010, at 8:13 AM, Gerrit Jansen van Vuuren wrote:

> HI,
>
>
>
>
>
> Is there any way in Pig where a LoadFunc can retrieve the Schema
> definition
> entered by the user in the AS clause?
>
> e.g. A = LOAD '$INPUT' USING MyLoader() AS (a:int,  b:int);
>
>
>
> My question comes from  the below problem I'm facing:
>
>
>
> So I'm writing a Loader that adds partition fields to the Schema. E.g.
> daydate, day, year month etc.
>
> These partitions are used to filter out entire folders in the storage
> location.
>
> I want to use the FILTER statement to filter by these keys.
>
>
>
> So if I create a Loader that returns its own Schema the following
> works and
> the LoadMetaData: setPartitionFilter method gets called correctly by
> pig.
>
> e.g.
>
> A = LOAD '$INPUT' using MyLoader('a:int, b:int'); -- the loader will
> parse
> this and also add the partition folder daydate
>
> F = FILTER A BY daydate='2010-11-01';
>
> STORE F INTO '$OUTPUT'
>
>
>
>
>
> But if the Loader does not return a Schema and the Schema is defined
> by the
> user in the AS clause Pig never calls
> LoadMetaData:setPartitionFilter at all
> and the partition filtering never happens.
>
> e.g.
>
> A = LOAD '$INPUT' AS (a:int, b:int, daydate:chararray);
>
> F = FILTER A BY daydate='2010-11-01';
>
> STORE F INTO '$OUTPUT';
>
>
>
> Any suggestions?
>
>
>
> Thanks,
>
> Gerrit
>
+
Gerrit Jansen van Vuuren 2010-11-11, 15:45
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB