Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Restricting loading of log files based on parameter input


Copy link to this message
-
Re: Restricting loading of log files based on parameter input
Cheolsoo Park 2013-02-20, 19:39
Hi Ian,

Sorry for the late reply.

>> Is there a method I can override which considers reading on a
file-by-file basis? Our Omniture logs have a date stamp in the filename,
and it would be more maintainable to reject a file based on its basename
rather than its path. We're more likely to change our paths than change the
filenames, so this would mean the code has a better chance of standing the
test of time.

The location parameter in setLocation(String location, Job job) is just a
path glob, so you can replace it with a filename-based pattern. For
example, if you have the following in Pig script,

A = LOAD '/foo/replace_me_with_regex' USING MyLoadFunc('2013', '1', '28');

You can do something like this:

@Override
public void setLocation(String location, Job job) {
   String regex = log_path_regex(year, month, day);
   location.replace('replace_me_with_filename', reg);
   FileInputFormat.setInputPaths(job, location);
}

// This is a java version of your function that returns a filename pattern.
private String log_patt_regex(String y, String m, String d) {
   // compute regex
}

Thanks,
Cheolsoo

On Tue, Feb 19, 2013 at 1:55 PM, Stevens, Ian
<[EMAIL PROTECTED]>wrote:

> Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was an
> easy win, although I hate that I have to do this. It's just text
> concatenation after all. Moving the logic of our log path structure to a
> Java class or an external package is wrong from a maintenance standpoint.
>
> How familiar are you (or anyone) with creating a custom LoadFunc? The
> documentation I've found is sparse. Is there a method I can override which
> considers reading on a file-by-file basis? Our Omniture logs have a date
> stamp in the filename, and it would be more maintainable to reject a file
> based on its basename rather than its path. We're more likely to change our
> paths than change the filenames, so this would mean the code has a better
> chance of standing the test of time.
>
> Cheers,
> Ian.
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
> Sent: February-15-13 4:53 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Restricting loading of log files based on parameter input
>
> Hi Ian,
>
> 1) Pre-processor statements are just text substitution, so you can't call
> a Python (or Java) function inside %declare.
>
> 2) Regarding DEFINE statements, there are two problems using them with
> scripting UDF:
> - You can't pass constructor parameters to scripting UDF.
> - You can't use scripting UDF for Load/StoreFunc.
>
> Given these constraints, I think writing a Java LoadFunc seems to be the
> best option. I would write a sub-class of OmnitureTextLoader in such a way
> that it can take constructor parameters. For example,
>
> class MyOmnitureTextLoader extends OmnitureTextLoader {
>
>   private String year;
>   private String month;
>
>   public MyOmnitureTextLoader() { ... }
>   public MyOmnitureTextLoader(String year, String month) { ... }
>
>   @Override
>   setLocation(String location, Job job) {
>     // Compute week path with year and month and replace location with
> that.
>   }
> }
>
> Then, you can do something like in Pig:
>
> DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month);
>
> A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;
>
> Hope this is helpful.
>
> Thanks,
> Cheolsoo
>
>
>
>
> On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian
> <[EMAIL PROTECTED]>wrote:
>
> > Hi everyone. I'm having a problem loading log files based on parameter
> > input and was wondering whether someone would be able to provide some
> > guidance. The logs in question are Omniture logs, stored in
> > subdirectories based on year, month, and day (eg.
> > /year=2013/month=02/day=14). For any day, multiple logs could exist,
> each hundreds of MB.
> >
> > I have a Pig script which currently processes logs for an entire
> > month, with the month and the year specified as script parameters (eg.
> > /year=$year/month=$month/day=*). It works fine and we're quite happy