Pig, mail # user - Restricting loading of log files based on parameter input

Stevens, Ian 2013-02-14, 22:16
Re: Restricting loading of log files based on parameter input
Cheolsoo Park 2013-02-15, 21:53
Hi Ian,

1) Pre-processor statements are just text substitution, so you can't call a
Python (or Java) function inside %declare.

2) Regarding DEFINE statements, there are two problems using them with
scripting UDF:
- You can't pass constructor parameters to scripting UDF.
- You can't use scripting UDF for Load/StoreFunc.

Given these constraints, I think writing a Java LoadFunc seems to be the
best option. I would write a sub-class of OmnitureTextLoader in such a way
that it can take constructor parameters. For example,

class MyOmnitureTextLoader extends OmnitureTextLoader {

  private String year;
  private String month;

  public MyOmnitureTextLoader() { ... }
  public MyOmnitureTextLoader(String year, String month) { ... }

  setLocation(String location, Job job) {
    // Compute week path with year and month and replace location with that.

Then, you can do something like in Pig:

DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month);

A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;

Hope this is helpful.

On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian

> Hi everyone. I'm having a problem loading log files based on parameter
> input and was wondering whether someone would be able to provide some
> guidance. The logs in question are Omniture logs, stored in subdirectories
> based on year, month, and day (eg. /year=2013/month=02/day=14). For any
> day, multiple logs could exist, each hundreds of MB.
> I have a Pig script which currently processes logs for an entire month,
> with the month and the year specified as script parameters (eg.
> /year=$year/month=$month/day=*). It works fine and we're quite happy with
> it. That said, we want to switch to weekly processing of logs, which means
> the previous LOAD path glob won't work (weeks can wrap months as well as
> years). To solve this, I have a Python UDF which takes a start date and
> spits out the necessary glob for a week's worth of logs, eg:
>                 >>> log_path_regex(2013, 1, 28)
> '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'
> This glob will then be inserted in the appropriate path:
>                 > %declare omniture_log_path
> 's3://foo/bar/$week_path/*.tsv.gz';
>                 > data = LOAD '$omniture_log_path' USING
> OmnitureTextLoader(); // See
> http://github.com/msukmanowsky/OmnitureTextLoader
> Unfortunately, I can't for the life of me figure out how to populate
> $week_path based on $year, $month and $day script parameters. I tried using
> %declare but grunt complains, says its logging but never does:
> > %declare week_path util.log_path_regex(year, month, day);
> 2013-02-14 16:54:02,648 [main] INFO  org.apache.pig.Main - Apache Pig
> version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
> 2013-02-1416:54:02,648 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /tmp/pig_1360878842643.log
> % ls  /tmp/pig_1360878842643.log
> ls: cannot access /tmp/pig_1360878842643.log: No such file or directory
> The same error results if I prefix the parameters with dollar signs or
> surround prefixed parameters with quotes.
> If I try to use define (which I believe only works for static Java
> functions), I get the following:
>                 > define week_path util.log_path_regex(year, month, day);
>                 2013-02-14 17:00:42,392 [main] ERROR
> org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11,
> column 37>  mismatched input 'year' expecting RIGHT_PAREN
> As with %declare, I get the same error if I prefix the parameters with
> dollar signs or surround prefixed parameters with quotes.
> I've searched around and haven't come up with a solution. I'm possibly
> searching for the wrong thing. Invoking a shell command may work, but would
> be difficult as it would complicate our script deploy and may not be
