|
|
-
Restricting loading of log files based on parameter input
Stevens, Ian 2013-02-14, 22:16
Hi everyone. I'm having a problem loading log files based on parameter input and was wondering whether someone would be able to provide some guidance. The logs in question are Omniture logs, stored in subdirectories based on year, month, and day (eg. /year=2013/month=02/day=14). For any day, multiple logs could exist, each hundreds of MB. I have a Pig script which currently processes logs for an entire month, with the month and the year specified as script parameters (eg. /year=$year/month=$month/day=*). It works fine and we're quite happy with it. That said, we want to switch to weekly processing of logs, which means the previous LOAD path glob won't work (weeks can wrap months as well as years). To solve this, I have a Python UDF which takes a start date and spits out the necessary glob for a week's worth of logs, eg: >>> log_path_regex(2013, 1, 28) '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}' This glob will then be inserted in the appropriate path: > %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz'; > data = LOAD '$omniture_log_path' USING OmnitureTextLoader(); // See http://github.com/msukmanowsky/OmnitureTextLoaderUnfortunately, I can't for the life of me figure out how to populate $week_path based on $year, $month and $day script parameters. I tried using %declare but grunt complains, says its logging but never does: > %declare week_path util.log_path_regex(year, month, day); 2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13 2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/pig_1360878842643.log % ls /tmp/pig_1360878842643.log ls: cannot access /tmp/pig_1360878842643.log: No such file or directory The same error results if I prefix the parameters with dollar signs or surround prefixed parameters with quotes. If I try to use define (which I believe only works for static Java functions), I get the following: > define week_path util.log_path_regex(year, month, day); 2013-02-14 17:00:42,392 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11, column 37> mismatched input 'year' expecting RIGHT_PAREN As with %declare, I get the same error if I prefix the parameters with dollar signs or surround prefixed parameters with quotes. I've searched around and haven't come up with a solution. I'm possibly searching for the wrong thing. Invoking a shell command may work, but would be difficult as it would complicate our script deploy and may not be feasible given we're retrieving logs from S3 and not a mounted directory. It's also likely there's a nice Pig-friendly way to restrict LOAD other than using globs. That said, I'd still have to use my UDF which seems to be the root of the issue. Do I need to convert my UDF to a static Java method? Or will I run into the same issue? (I hesitate to do this on the off-chance it will work. It's an 8-line Python function, readily deployable and much more maintainable by others than the equivalent Java code would be.) Any ideas? Cheers, Ian.
+
Stevens, Ian 2013-02-14, 22:16
-
Re: Restricting loading of log files based on parameter input
Cheolsoo Park 2013-02-15, 21:53
Hi Ian, 1) Pre-processor statements are just text substitution, so you can't call a Python (or Java) function inside %declare. 2) Regarding DEFINE statements, there are two problems using them with scripting UDF: - You can't pass constructor parameters to scripting UDF. - You can't use scripting UDF for Load/StoreFunc. Given these constraints, I think writing a Java LoadFunc seems to be the best option. I would write a sub-class of OmnitureTextLoader in such a way that it can take constructor parameters. For example, class MyOmnitureTextLoader extends OmnitureTextLoader { private String year; private String month; public MyOmnitureTextLoader() { ... } public MyOmnitureTextLoader(String year, String month) { ... } @Override setLocation(String location, Job job) { // Compute week path with year and month and replace location with that. } } Then, you can do something like in Pig: DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month); A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER; Hope this is helpful. Thanks, Cheolsoo On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian <[EMAIL PROTECTED]>wrote: > Hi everyone. I'm having a problem loading log files based on parameter > input and was wondering whether someone would be able to provide some > guidance. The logs in question are Omniture logs, stored in subdirectories > based on year, month, and day (eg. /year=2013/month=02/day=14). For any > day, multiple logs could exist, each hundreds of MB. > > I have a Pig script which currently processes logs for an entire month, > with the month and the year specified as script parameters (eg. > /year=$year/month=$month/day=*). It works fine and we're quite happy with > it. That said, we want to switch to weekly processing of logs, which means > the previous LOAD path glob won't work (weeks can wrap months as well as > years). To solve this, I have a Python UDF which takes a start date and > spits out the necessary glob for a week's worth of logs, eg: > > >>> log_path_regex(2013, 1, 28) > > '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}' > > This glob will then be inserted in the appropriate path: > > > %declare omniture_log_path > 's3://foo/bar/$week_path/*.tsv.gz'; > > data = LOAD '$omniture_log_path' USING > OmnitureTextLoader(); // See > http://github.com/msukmanowsky/OmnitureTextLoader> > Unfortunately, I can't for the life of me figure out how to populate > $week_path based on $year, $month and $day script parameters. I tried using > %declare but grunt complains, says its logging but never does: > > > %declare week_path util.log_path_regex(year, month, day); > 2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig > version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13 > > 2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging error > messages to: /tmp/pig_1360878842643.log > % ls /tmp/pig_1360878842643.log > ls: cannot access /tmp/pig_1360878842643.log: No such file or directory > > The same error results if I prefix the parameters with dollar signs or > surround prefixed parameters with quotes. > > If I try to use define (which I believe only works for static Java > functions), I get the following: > > > define week_path util.log_path_regex(year, month, day); > 2013-02-14 17:00:42,392 [main] ERROR > org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11, > column 37> mismatched input 'year' expecting RIGHT_PAREN > > As with %declare, I get the same error if I prefix the parameters with > dollar signs or surround prefixed parameters with quotes. > > I've searched around and haven't come up with a solution. I'm possibly > searching for the wrong thing. Invoking a shell command may work, but would > be difficult as it would complicate our script deploy and may not be
+
Cheolsoo Park 2013-02-15, 21:53
-
RE: Restricting loading of log files based on parameter input
Stevens, Ian 2013-02-19, 21:55
Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was an easy win, although I hate that I have to do this. It's just text concatenation after all. Moving the logic of our log path structure to a Java class or an external package is wrong from a maintenance standpoint. How familiar are you (or anyone) with creating a custom LoadFunc? The documentation I've found is sparse. Is there a method I can override which considers reading on a file-by-file basis? Our Omniture logs have a date stamp in the filename, and it would be more maintainable to reject a file based on its basename rather than its path. We're more likely to change our paths than change the filenames, so this would mean the code has a better chance of standing the test of time. Cheers, Ian. -----Original Message----- From: Cheolsoo Park [mailto:[EMAIL PROTECTED]] Sent: February-15-13 4:53 PM To: [EMAIL PROTECTED] Subject: Re: Restricting loading of log files based on parameter input Hi Ian, 1) Pre-processor statements are just text substitution, so you can't call a Python (or Java) function inside %declare. 2) Regarding DEFINE statements, there are two problems using them with scripting UDF: - You can't pass constructor parameters to scripting UDF. - You can't use scripting UDF for Load/StoreFunc. Given these constraints, I think writing a Java LoadFunc seems to be the best option. I would write a sub-class of OmnitureTextLoader in such a way that it can take constructor parameters. For example, class MyOmnitureTextLoader extends OmnitureTextLoader { private String year; private String month; public MyOmnitureTextLoader() { ... } public MyOmnitureTextLoader(String year, String month) { ... } @Override setLocation(String location, Job job) { // Compute week path with year and month and replace location with that. } } Then, you can do something like in Pig: DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month); A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER; Hope this is helpful. Thanks, Cheolsoo On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian <[EMAIL PROTECTED]>wrote: > Hi everyone. I'm having a problem loading log files based on parameter > input and was wondering whether someone would be able to provide some > guidance. The logs in question are Omniture logs, stored in > subdirectories based on year, month, and day (eg. > /year=2013/month=02/day=14). For any day, multiple logs could exist, each hundreds of MB. > > I have a Pig script which currently processes logs for an entire > month, with the month and the year specified as script parameters (eg. > /year=$year/month=$month/day=*). It works fine and we're quite happy > with it. That said, we want to switch to weekly processing of logs, > which means the previous LOAD path glob won't work (weeks can wrap > months as well as years). To solve this, I have a Python UDF which > takes a start date and spits out the necessary glob for a week's worth of logs, eg: > > >>> log_path_regex(2013, 1, 28) > > '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}' > > This glob will then be inserted in the appropriate path: > > > %declare omniture_log_path > 's3://foo/bar/$week_path/*.tsv.gz'; > > data = LOAD '$omniture_log_path' USING > OmnitureTextLoader(); // See > http://github.com/msukmanowsky/OmnitureTextLoader> > Unfortunately, I can't for the life of me figure out how to populate > $week_path based on $year, $month and $day script parameters. I tried > using %declare but grunt complains, says its logging but never does: > > > %declare week_path util.log_path_regex(year, month, day); > 2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig > version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13 > > 2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging
+
Stevens, Ian 2013-02-19, 21:55
-
Re: Restricting loading of log files based on parameter input
Cheolsoo Park 2013-02-20, 19:39
Hi Ian,
Sorry for the late reply.
>> Is there a method I can override which considers reading on a file-by-file basis? Our Omniture logs have a date stamp in the filename, and it would be more maintainable to reject a file based on its basename rather than its path. We're more likely to change our paths than change the filenames, so this would mean the code has a better chance of standing the test of time.
The location parameter in setLocation(String location, Job job) is just a path glob, so you can replace it with a filename-based pattern. For example, if you have the following in Pig script,
A = LOAD '/foo/replace_me_with_regex' USING MyLoadFunc('2013', '1', '28');
You can do something like this:
@Override public void setLocation(String location, Job job) { String regex = log_path_regex(year, month, day); location.replace('replace_me_with_filename', reg); FileInputFormat.setInputPaths(job, location); }
// This is a java version of your function that returns a filename pattern. private String log_patt_regex(String y, String m, String d) { // compute regex }
Thanks, Cheolsoo
On Tue, Feb 19, 2013 at 1:55 PM, Stevens, Ian <[EMAIL PROTECTED]>wrote:
> Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was an > easy win, although I hate that I have to do this. It's just text > concatenation after all. Moving the logic of our log path structure to a > Java class or an external package is wrong from a maintenance standpoint. > > How familiar are you (or anyone) with creating a custom LoadFunc? The > documentation I've found is sparse. Is there a method I can override which > considers reading on a file-by-file basis? Our Omniture logs have a date > stamp in the filename, and it would be more maintainable to reject a file > based on its basename rather than its path. We're more likely to change our > paths than change the filenames, so this would mean the code has a better > chance of standing the test of time. > > Cheers, > Ian. > > -----Original Message----- > From: Cheolsoo Park [mailto:[EMAIL PROTECTED]] > Sent: February-15-13 4:53 PM > To: [EMAIL PROTECTED] > Subject: Re: Restricting loading of log files based on parameter input > > Hi Ian, > > 1) Pre-processor statements are just text substitution, so you can't call > a Python (or Java) function inside %declare. > > 2) Regarding DEFINE statements, there are two problems using them with > scripting UDF: > - You can't pass constructor parameters to scripting UDF. > - You can't use scripting UDF for Load/StoreFunc. > > Given these constraints, I think writing a Java LoadFunc seems to be the > best option. I would write a sub-class of OmnitureTextLoader in such a way > that it can take constructor parameters. For example, > > class MyOmnitureTextLoader extends OmnitureTextLoader { > > private String year; > private String month; > > public MyOmnitureTextLoader() { ... } > public MyOmnitureTextLoader(String year, String month) { ... } > > @Override > setLocation(String location, Job job) { > // Compute week path with year and month and replace location with > that. > } > } > > Then, you can do something like in Pig: > > DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month); > > A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER; > > Hope this is helpful. > > Thanks, > Cheolsoo > > > > > On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian > <[EMAIL PROTECTED]>wrote: > > > Hi everyone. I'm having a problem loading log files based on parameter > > input and was wondering whether someone would be able to provide some > > guidance. The logs in question are Omniture logs, stored in > > subdirectories based on year, month, and day (eg. > > /year=2013/month=02/day=14). For any day, multiple logs could exist, > each hundreds of MB. > > > > I have a Pig script which currently processes logs for an entire > > month, with the month and the year specified as script parameters (eg. > > /year=$year/month=$month/day=*). It works fine and we're quite happy
+
Cheolsoo Park 2013-02-20, 19:39
-
RE: Restricting loading of log files based on parameter input
Stevens, Ian 2013-02-20, 20:05
Thanks Cheolsoo. That's not exactly the answer I was looking for; I'm aware how an implementation of setLocation() could work. I was just looking for an alternate method to override, but I suspect there isn't one. I can work with the regex. BTW, if you're on StackOverflow and want to post your answer to my question there in order to claim points, you can do so at http://stackoverflow.com/questions/14885333/restricting-loading-of-log-files-in-pig-latin-based-on-interested-date-range-asCheers, Ian. -----Original Message----- From: Cheolsoo Park [mailto:[EMAIL PROTECTED]] Sent: February-20-13 2:39 PM To: [EMAIL PROTECTED] Subject: Re: Restricting loading of log files based on parameter input Hi Ian, Sorry for the late reply. >> Is there a method I can override which considers reading on a file-by-file basis? Our Omniture logs have a date stamp in the filename, and it would be more maintainable to reject a file based on its basename rather than its path. We're more likely to change our paths than change the filenames, so this would mean the code has a better chance of standing the test of time. The location parameter in setLocation(String location, Job job) is just a path glob, so you can replace it with a filename-based pattern. For example, if you have the following in Pig script, A = LOAD '/foo/replace_me_with_regex' USING MyLoadFunc('2013', '1', '28'); You can do something like this: @Override public void setLocation(String location, Job job) { String regex = log_path_regex(year, month, day); location.replace('replace_me_with_filename', reg); FileInputFormat.setInputPaths(job, location); } // This is a java version of your function that returns a filename pattern. private String log_patt_regex(String y, String m, String d) { // compute regex } Thanks, Cheolsoo On Tue, Feb 19, 2013 at 1:55 PM, Stevens, Ian <[EMAIL PROTECTED]>wrote: > Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was > an easy win, although I hate that I have to do this. It's just text > concatenation after all. Moving the logic of our log path structure to > a Java class or an external package is wrong from a maintenance standpoint. > > How familiar are you (or anyone) with creating a custom LoadFunc? The > documentation I've found is sparse. Is there a method I can override > which considers reading on a file-by-file basis? Our Omniture logs > have a date stamp in the filename, and it would be more maintainable > to reject a file based on its basename rather than its path. We're > more likely to change our paths than change the filenames, so this > would mean the code has a better chance of standing the test of time. > > Cheers, > Ian. > > -----Original Message----- > From: Cheolsoo Park [mailto:[EMAIL PROTECTED]] > Sent: February-15-13 4:53 PM > To: [EMAIL PROTECTED] > Subject: Re: Restricting loading of log files based on parameter input > > Hi Ian, > > 1) Pre-processor statements are just text substitution, so you can't > call a Python (or Java) function inside %declare. > > 2) Regarding DEFINE statements, there are two problems using them with > scripting UDF: > - You can't pass constructor parameters to scripting UDF. > - You can't use scripting UDF for Load/StoreFunc. > > Given these constraints, I think writing a Java LoadFunc seems to be > the best option. I would write a sub-class of OmnitureTextLoader in > such a way that it can take constructor parameters. For example, > > class MyOmnitureTextLoader extends OmnitureTextLoader { > > private String year; > private String month; > > public MyOmnitureTextLoader() { ... } > public MyOmnitureTextLoader(String year, String month) { ... } > > @Override > setLocation(String location, Job job) { > // Compute week path with year and month and replace location with > that. > } > } > > Then, you can do something like in Pig: > > DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month); > > A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;
+
Stevens, Ian 2013-02-20, 20:05
-
RE: Restricting loading of log files based on parameter input
Stevens, Ian 2013-02-15, 21:59
I forgot to mention that the date stamp also exists in the filename of the log in addition to the path. Is a custom LoadFunc the answer? With that, I'd presumably have to specify /year=*/month=*/day=* and force Pig to test every file name for a date stamp which falls between two dates. That seems like a huge hack and a waste of resources. Ian. -----Original Message----- From: Stevens, Ian [mailto:[EMAIL PROTECTED]] Sent: February-14-13 5:17 PM To: '[EMAIL PROTECTED]' Subject: Restricting loading of log files based on parameter input Hi everyone. I'm having a problem loading log files based on parameter input and was wondering whether someone would be able to provide some guidance. The logs in question are Omniture logs, stored in subdirectories based on year, month, and day (eg. /year=2013/month=02/day=14). For any day, multiple logs could exist, each hundreds of MB. I have a Pig script which currently processes logs for an entire month, with the month and the year specified as script parameters (eg. /year=$year/month=$month/day=*). It works fine and we're quite happy with it. That said, we want to switch to weekly processing of logs, which means the previous LOAD path glob won't work (weeks can wrap months as well as years). To solve this, I have a Python UDF which takes a start date and spits out the necessary glob for a week's worth of logs, eg: >>> log_path_regex(2013, 1, 28) '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}' This glob will then be inserted in the appropriate path: > %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz'; > data = LOAD '$omniture_log_path' USING OmnitureTextLoader(); // See http://github.com/msukmanowsky/OmnitureTextLoaderUnfortunately, I can't for the life of me figure out how to populate $week_path based on $year, $month and $day script parameters. I tried using %declare but grunt complains, says its logging but never does: > %declare week_path util.log_path_regex(year, month, day); 2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13 2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/pig_1360878842643.log % ls /tmp/pig_1360878842643.log ls: cannot access /tmp/pig_1360878842643.log: No such file or directory The same error results if I prefix the parameters with dollar signs or surround prefixed parameters with quotes. If I try to use define (which I believe only works for static Java functions), I get the following: > define week_path util.log_path_regex(year, month, day); 2013-02-14 17:00:42,392 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11, column 37> mismatched input 'year' expecting RIGHT_PAREN As with %declare, I get the same error if I prefix the parameters with dollar signs or surround prefixed parameters with quotes. I've searched around and haven't come up with a solution. I'm possibly searching for the wrong thing. Invoking a shell command may work, but would be difficult as it would complicate our script deploy and may not be feasible given we're retrieving logs from S3 and not a mounted directory. It's also likely there's a nice Pig-friendly way to restrict LOAD other than using globs. That said, I'd still have to use my UDF which seems to be the root of the issue. Do I need to convert my UDF to a static Java method? Or will I run into the same issue? (I hesitate to do this on the off-chance it will work. It's an 8-line Python function, readily deployable and much more maintainable by others than the equivalent Java code would be.) Any ideas? Cheers, Ian.
+
Stevens, Ian 2013-02-15, 21:59
|
|