Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Restricting loading of log files based on parameter input


+
Stevens, Ian 2013-02-14, 22:16
Copy link to this message
-
Re: Restricting loading of log files based on parameter input
Hi Ian,

1) Pre-processor statements are just text substitution, so you can't call a
Python (or Java) function inside %declare.

2) Regarding DEFINE statements, there are two problems using them with
scripting UDF:
- You can't pass constructor parameters to scripting UDF.
- You can't use scripting UDF for Load/StoreFunc.

Given these constraints, I think writing a Java LoadFunc seems to be the
best option. I would write a sub-class of OmnitureTextLoader in such a way
that it can take constructor parameters. For example,

class MyOmnitureTextLoader extends OmnitureTextLoader {

  private String year;
  private String month;

  public MyOmnitureTextLoader() { ... }
  public MyOmnitureTextLoader(String year, String month) { ... }

  @Override
  setLocation(String location, Job job) {
    // Compute week path with year and month and replace location with that.
  }
}

Then, you can do something like in Pig:

DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month);

A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;

Hope this is helpful.

Thanks,
Cheolsoo
On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian
<[EMAIL PROTECTED]>wrote:

> Hi everyone. I'm having a problem loading log files based on parameter
> input and was wondering whether someone would be able to provide some
> guidance. The logs in question are Omniture logs, stored in subdirectories
> based on year, month, and day (eg. /year=2013/month=02/day=14). For any
> day, multiple logs could exist, each hundreds of MB.
>
> I have a Pig script which currently processes logs for an entire month,
> with the month and the year specified as script parameters (eg.
> /year=$year/month=$month/day=*). It works fine and we're quite happy with
> it. That said, we want to switch to weekly processing of logs, which means
> the previous LOAD path glob won't work (weeks can wrap months as well as
> years). To solve this, I have a Python UDF which takes a start date and
> spits out the necessary glob for a week's worth of logs, eg:
>
>                 >>> log_path_regex(2013, 1, 28)
>
> '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'
>
> This glob will then be inserted in the appropriate path:
>
>                 > %declare omniture_log_path
> 's3://foo/bar/$week_path/*.tsv.gz';
>                 > data = LOAD '$omniture_log_path' USING
> OmnitureTextLoader(); // See
> http://github.com/msukmanowsky/OmnitureTextLoader
>
> Unfortunately, I can't for the life of me figure out how to populate
> $week_path based on $year, $month and $day script parameters. I tried using
> %declare but grunt complains, says its logging but never does:
>
> > %declare week_path util.log_path_regex(year, month, day);
> 2013-02-14 16:54:02,648 [main] INFO  org.apache.pig.Main - Apache Pig
> version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
>
> 2013-02-1416:54:02,648 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /tmp/pig_1360878842643.log
> % ls  /tmp/pig_1360878842643.log
> ls: cannot access /tmp/pig_1360878842643.log: No such file or directory
>
> The same error results if I prefix the parameters with dollar signs or
> surround prefixed parameters with quotes.
>
> If I try to use define (which I believe only works for static Java
> functions), I get the following:
>
>                 > define week_path util.log_path_regex(year, month, day);
>                 2013-02-14 17:00:42,392 [main] ERROR
> org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11,
> column 37>  mismatched input 'year' expecting RIGHT_PAREN
>
> As with %declare, I get the same error if I prefix the parameters with
> dollar signs or surround prefixed parameters with quotes.
>
> I've searched around and haven't come up with a solution. I'm possibly
> searching for the wrong thing. Invoking a shell command may work, but would
> be difficult as it would complicate our script deploy and may not be
+
Stevens, Ian 2013-02-19, 21:55
+
Cheolsoo Park 2013-02-20, 19:39
+
Stevens, Ian 2013-02-20, 20:05
+
Stevens, Ian 2013-02-15, 21:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB