Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Get ResourceSchema during putNext in StoreFunc

Copy link to this message
Re: Get ResourceSchema during putNext in StoreFunc
So I don't get null when I read the schema in the checkSchema method. I
set the class's internal schema variable, as in your gist, and it's not
null in the very next call to 'setStoreLocation'. However, it is null on
all later calls to 'setStoreLocation' and any and all calls to putNext.
Not entirely sure when it goes out of scope.

Now, if this was vanilla map-reduce I'd say that checkSchema is being
called once during the initial map-reduce job setup phase and anything
you do in there is not going to be accessible to your later tasks which
are happening on many different machines in the cluster.

You could set the schema with checkSchema and then on the FIRST call to
setStoreLocation you could place the schema in the job's configuration
as a string. What I'm not sure about is exactly how many times
setStoreLocation is actually called. I suspect (any Pig devs wanna help
me out here?) that it's called exactly once per task (ie. during the
call to 'setup()' in vanilla map-reduce land). If that's true then all
you'd have to do is set it the first time then read it on all subsequent
calls to setStoreLocation. Could try it out at least...


On Tue, 2011-02-01 at 15:23 +0000, Dan Harvey wrote:
> This is the same problem I was getting, I've put a snippit of the code
> I as was using here :- https://gist.github.com/804551
> With this I get null whenever I try to read the ResourceSchema object
> in the checkSchema() method.
> I've had a look over the AvroStorage and it seems to assume the
> ResourceSchema won't be null at this point in time so I'm not sure
> what's going on for me.
> Does anyone know if this is the best way to get the schema, or if pig
> will ever send a null schema to the checkSchema method?
> Thanks,
> On 1 February 2011 04:46, Jacob Perkins <[EMAIL PROTECTED]> wrote:
> >
> > Trying to write a simple storefunc that makes use of the input data's
> > field names. Is there a way to gain access to this inside of the call to
> > putNext? Ostensibly you could set a variable with the schema during the
> > call to checkSchema, eg. in HBaseStorage, but as far as I can tell this
> > is null by the time putNext is called. Is there some other way or am I
> > missing something obvious?
> >
> > --jacob
> > @thedatachef
> >
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015