Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Restricting loading of log files based on parameter input

Copy link to this message
RE: Restricting loading of log files based on parameter input
Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was an easy win, although I hate that I have to do this. It's just text concatenation after all. Moving the logic of our log path structure to a Java class or an external package is wrong from a maintenance standpoint.

How familiar are you (or anyone) with creating a custom LoadFunc? The documentation I've found is sparse. Is there a method I can override which considers reading on a file-by-file basis? Our Omniture logs have a date stamp in the filename, and it would be more maintainable to reject a file based on its basename rather than its path. We're more likely to change our paths than change the filenames, so this would mean the code has a better chance of standing the test of time.


-----Original Message-----
From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
Sent: February-15-13 4:53 PM
Subject: Re: Restricting loading of log files based on parameter input

Hi Ian,

1) Pre-processor statements are just text substitution, so you can't call a Python (or Java) function inside %declare.

2) Regarding DEFINE statements, there are two problems using them with scripting UDF:
- You can't pass constructor parameters to scripting UDF.
- You can't use scripting UDF for Load/StoreFunc.

Given these constraints, I think writing a Java LoadFunc seems to be the best option. I would write a sub-class of OmnitureTextLoader in such a way that it can take constructor parameters. For example,

class MyOmnitureTextLoader extends OmnitureTextLoader {

  private String year;
  private String month;

  public MyOmnitureTextLoader() { ... }
  public MyOmnitureTextLoader(String year, String month) { ... }

  setLocation(String location, Job job) {
    // Compute week path with year and month and replace location with that.

Then, you can do something like in Pig:

DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month);

A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;

Hope this is helpful.

On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian

> Hi everyone. I'm having a problem loading log files based on parameter
> input and was wondering whether someone would be able to provide some
> guidance. The logs in question are Omniture logs, stored in
> subdirectories based on year, month, and day (eg.
> /year=2013/month=02/day=14). For any day, multiple logs could exist, each hundreds of MB.
> I have a Pig script which currently processes logs for an entire
> month, with the month and the year specified as script parameters (eg.
> /year=$year/month=$month/day=*). It works fine and we're quite happy
> with it. That said, we want to switch to weekly processing of logs,
> which means the previous LOAD path glob won't work (weeks can wrap
> months as well as years). To solve this, I have a Python UDF which
> takes a start date and spits out the necessary glob for a week's worth of logs, eg:
>                 >>> log_path_regex(2013, 1, 28)
> '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'
> This glob will then be inserted in the appropriate path:
>                 > %declare omniture_log_path
> 's3://foo/bar/$week_path/*.tsv.gz';
>                 > data = LOAD '$omniture_log_path' USING
> OmnitureTextLoader(); // See
> http://github.com/msukmanowsky/OmnitureTextLoader
> Unfortunately, I can't for the life of me figure out how to populate
> $week_path based on $year, $month and $day script parameters. I tried
> using %declare but grunt complains, says its logging but never does:
> > %declare week_path util.log_path_regex(year, month, day);
> 2013-02-14 16:54:02,648 [main] INFO  org.apache.pig.Main - Apache Pig
> version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
> 2013-02-1416:54:02,648 [main] INFO  org.apache.pig.Main - Logging