Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Restricting loading of log files based on parameter input


+
Stevens, Ian 2013-02-14, 22:16
+
Cheolsoo Park 2013-02-15, 21:53
+
Stevens, Ian 2013-02-19, 21:55
+
Cheolsoo Park 2013-02-20, 19:39
Copy link to this message
-
RE: Restricting loading of log files based on parameter input
Stevens, Ian 2013-02-20, 20:05
Thanks Cheolsoo. That's not exactly the answer I was looking for; I'm aware how an implementation of setLocation() could work. I was just looking for an alternate method to override, but I suspect there isn't one. I can work with the regex.

BTW, if you're on StackOverflow and want to post your answer to my question there in order to claim points, you can do so at http://stackoverflow.com/questions/14885333/restricting-loading-of-log-files-in-pig-latin-based-on-interested-date-range-as

Cheers,
Ian.

-----Original Message-----
From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
Sent: February-20-13 2:39 PM
To: [EMAIL PROTECTED]
Subject: Re: Restricting loading of log files based on parameter input

Hi Ian,

Sorry for the late reply.

>> Is there a method I can override which considers reading on a
file-by-file basis? Our Omniture logs have a date stamp in the filename, and it would be more maintainable to reject a file based on its basename rather than its path. We're more likely to change our paths than change the filenames, so this would mean the code has a better chance of standing the test of time.

The location parameter in setLocation(String location, Job job) is just a path glob, so you can replace it with a filename-based pattern. For example, if you have the following in Pig script,

A = LOAD '/foo/replace_me_with_regex' USING MyLoadFunc('2013', '1', '28');

You can do something like this:

@Override
public void setLocation(String location, Job job) {
   String regex = log_path_regex(year, month, day);
   location.replace('replace_me_with_filename', reg);
   FileInputFormat.setInputPaths(job, location); }

// This is a java version of your function that returns a filename pattern.
private String log_patt_regex(String y, String m, String d) {
   // compute regex
}

Thanks,
Cheolsoo

On Tue, Feb 19, 2013 at 1:55 PM, Stevens, Ian
<[EMAIL PROTECTED]>wrote:

> Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was
> an easy win, although I hate that I have to do this. It's just text
> concatenation after all. Moving the logic of our log path structure to
> a Java class or an external package is wrong from a maintenance standpoint.
>
> How familiar are you (or anyone) with creating a custom LoadFunc? The
> documentation I've found is sparse. Is there a method I can override
> which considers reading on a file-by-file basis? Our Omniture logs
> have a date stamp in the filename, and it would be more maintainable
> to reject a file based on its basename rather than its path. We're
> more likely to change our paths than change the filenames, so this
> would mean the code has a better chance of standing the test of time.
>
> Cheers,
> Ian.
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
> Sent: February-15-13 4:53 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Restricting loading of log files based on parameter input
>
> Hi Ian,
>
> 1) Pre-processor statements are just text substitution, so you can't
> call a Python (or Java) function inside %declare.
>
> 2) Regarding DEFINE statements, there are two problems using them with
> scripting UDF:
> - You can't pass constructor parameters to scripting UDF.
> - You can't use scripting UDF for Load/StoreFunc.
>
> Given these constraints, I think writing a Java LoadFunc seems to be
> the best option. I would write a sub-class of OmnitureTextLoader in
> such a way that it can take constructor parameters. For example,
>
> class MyOmnitureTextLoader extends OmnitureTextLoader {
>
>   private String year;
>   private String month;
>
>   public MyOmnitureTextLoader() { ... }
>   public MyOmnitureTextLoader(String year, String month) { ... }
>
>   @Override
>   setLocation(String location, Job job) {
>     // Compute week path with year and month and replace location with
> that.
>   }
> }
>
> Then, you can do something like in Pig:
>
> DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month);
>
> A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;
+
Stevens, Ian 2013-02-15, 21:59