Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Question about properties for Loader

Jeff Yuan 2013-02-24, 23:33
Prashant Kommireddi 2013-02-25, 00:02
Jeff Yuan 2013-02-25, 01:07
Copy link to this message
Re: Question about properties for Loader
Hi Jeff,
It does not sound like you need properties (or a configuration). It sounds
like you want to pass arguments to your LoadFunc. You can create a LoadFunc
that takes an arbitrary number of String arguments. For example, the
default loader, PigStorage, takes 2 arguments: the first is a delimiter
(let's ignore the 2nd arg for now, it's advanced). So if you have a file
delimited by a colon rather than a tab, you can say this:

mystuff = load '/some/path' using PigStorage(':');

And this will cause the PigStorage(String delimiter) constructor to be
called. PigStorage will store the delimiter and use it to parse records.
The same constructor will be called on the client side (during parsing) and
on the server side (in the mapper task initialization.

Now, if you want users to be able to change arguments without modifying the
script, you can parametrize the script. So instead you could say

mystuff = load 'some/path' using PigStorage('$DELIM');

and call your script with "pig --param DELIM=':' myscript.pig". That way
you can change the delimiter at invocation time.

If you do for some reason want to change job properties, you can use the -D
flag (eg, '-Dpig.exec.mapPartAgg=true'). This will be available via the
Configuration *and* via Properties -- I don't want to get into the
differences because it's messy, but basically they are somewhat
interchangeable and you use whichever one is handy. If you are trying to
set a property from inside the code, you probably want to change it in the

The difference between -p and -D is that one is a parameter to the script,
while the other is more of an environment setting.

Hope this helps,

On Sun, Feb 24, 2013 at 5:07 PM, Jeff Yuan <[EMAIL PROTECTED]> wrote:

> Thanks for the pointers Prashant. I will take a look at PigStorage.
> I have a system for storing metadata, so users don't have to specify it.
> With respect to the properties, I guess my question is, are the ones
> passed in from the command line via -p stored in Property or
> Configuration from the UDFContext? What's the difference between
> Property and Configuration?
> Thanks.
> On Sun, Feb 24, 2013 at 4:02 PM, Prashant Kommireddi
> <[EMAIL PROTECTED]> wrote:
> > Hi Jeff,
> >
> > How do you see your loader being used? Would users specify schema file or
> > would that be something your loader sets without user being aware of it?
> > Can you pass it in as a constructor argument instead?
> >
> > UDFContext could be used, like you said to set/retrieve properties. You
> > might want to take a look at PigStorage that does something very similar
> > (look for the method applySchema(Tuple tup) )
> >
> > On Sun, Feb 24, 2013 at 3:33 PM, Jeff Yuan <[EMAIL PROTECTED]>
> wrote:
> >
> >> I'm trying to write a loader, extending LoadFunc, to read a specific
> >> file format.
> >>
> >> My question, how do I pass properties to it (for example the schema of
> >> the file type I'm loading)?  Would it be using the -p parameter from
> >> the cmdline when issuing the query?
> >>
> >> The second part of the question is, how would I access the passed in
> >> property/configuration from the code?  So far I'm theorizing it's
> >> something like this:
> >>         Properties p = udfc.getUDFProperties(this.getClass(), new
> >> String[]{ contextSignature });
> >>         Configuration conf = udfc.getJobConf();
> >> Then get it from p or conf?
> >>
> >> Thanks a lot for any pointers.
> >>
> >> -Jeff
> >>