I forgot to mention that the date stamp also exists in the filename of the log in addition to the path.
Is a custom LoadFunc the answer? With that, I'd presumably have to specify /year=*/month=*/day=* and force Pig to test every file name for a date stamp which falls between two dates. That seems like a huge hack and a waste of resources.
From: Stevens, Ian [mailto:[EMAIL PROTECTED]]
Sent: February-14-13 5:17 PM
To: '[EMAIL PROTECTED]'
Subject: Restricting loading of log files based on parameter input
Hi everyone. I'm having a problem loading log files based on parameter input and was wondering whether someone would be able to provide some guidance. The logs in question are Omniture logs, stored in subdirectories based on year, month, and day (eg. /year=2013/month=02/day=14). For any day, multiple logs could exist, each hundreds of MB.
I have a Pig script which currently processes logs for an entire month, with the month and the year specified as script parameters (eg. /year=$year/month=$month/day=*). It works fine and we're quite happy with it. That said, we want to switch to weekly processing of logs, which means the previous LOAD path glob won't work (weeks can wrap months as well as years). To solve this, I have a Python UDF which takes a start date and spits out the necessary glob for a week's worth of logs, eg:
>>> log_path_regex(2013, 1, 28)
This glob will then be inserted in the appropriate path:
> %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz';
> data = LOAD '$omniture_log_path' USING OmnitureTextLoader(); // See http://github.com/msukmanowsky/OmnitureTextLoader
Unfortunately, I can't for the life of me figure out how to populate $week_path based on $year, $month and $day script parameters. I tried using %declare but grunt complains, says its logging but never does:
> %declare week_path util.log_path_regex(year, month, day);
2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/pig_1360878842643.log % ls /tmp/pig_1360878842643.log
ls: cannot access /tmp/pig_1360878842643.log: No such file or directory
The same error results if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.
If I try to use define (which I believe only works for static Java functions), I get the following:
> define week_path util.log_path_regex(year, month, day);
2013-02-14 17:00:42,392 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11, column 37> mismatched input 'year' expecting RIGHT_PAREN
As with %declare, I get the same error if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.
I've searched around and haven't come up with a solution. I'm possibly searching for the wrong thing. Invoking a shell command may work, but would be difficult as it would complicate our script deploy and may not be feasible given we're retrieving logs from S3 and not a mounted directory.
It's also likely there's a nice Pig-friendly way to restrict LOAD other than using globs. That said, I'd still have to use my UDF which seems to be the root of the issue.
Do I need to convert my UDF to a static Java method? Or will I run into the same issue? (I hesitate to do this on the off-chance it will work. It's an 8-line Python function, readily deployable and much more maintainable by others than the equivalent Java code would be.)