Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Re: Exception Handling in Pig Scripts


Copy link to this message
-
RE: Exception Handling in Pig Scripts
Olga Natkovich 2011-01-21, 21:16
Thanks, Julien. I also added a couple of questions to the wiki.

Olga

-----Original Message-----
From: Julien Le Dem [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 20, 2011 6:36 PM
To: [EMAIL PROTECTED]
Subject: Re: Exception Handling in Pig Scripts

I've summed up the thread here:
http://wiki.apache.org/pig/PigErrorHandlingInScripts
I'm sure it's biased toward its author's opinion, let me know what you think.
Julien

On 1/20/11 3:19 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote:

Sure :)

-----Original Message-----
From: Julien Le Dem [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 20, 2011 1:49 PM
To: [EMAIL PROTECTED]
Subject: Re: Exception Handling in Pig Scripts

I see there is a PigErrorHandling, what about calling it PigErrorHandlingInScripts ?
Julien

On 1/20/11 12:53 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote:

Hi guys,

Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.

Thanks,

Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 20, 2011 11:52 AM
To: [EMAIL PROTECTED]
Subject: Re: Exception Handling in Pig Scripts

Right, what I am saying is that the tasks would not fail because we'd catch
the errors.

Thanks for the lmyit link.. learn something new every day.

On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <[EMAIL PROTECTED]>wrote:

> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
> would expect that, but I don't know)
> Also using counters we should make sure we don't mix up multiple relations
> being combined by the optimizer.
>
> P.S.: Regarding rror, I don't see why you would want two of these:
> http://lmyit.com/rror
> :P
>
>
> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>
> I think this is coming together! I like the idea of a client-side handler
> method that allows us to look at all errors in aggregate and make a
> decisions based on proportions. How can we guard against catching the wrong
> mistakes -- say, letting a mapper that's running on a bad node and fails
> all
> local disk writes finish "successfully" even though properly, the task just
> needs to be rerun on a different mapper and normally MR would just take
> care
> of it?
> Let's put this on a wiki for wider feedback.
>
> P.S. What's a "rror" and why do we only want one of them?
>
> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <[EMAIL PROTECTED]>
> wrote:
>
> > Some more thoughts.
> >
> > * Looking at the existing keywords:
> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> > It seems ONERROR would be better than ON_ERROR for consistency. There is
> an
> > existing ONSCHEMA but no _ based keyword.
> >
> > * The default behavior should be to die on error and can be overridden as
> > follows:
> > DEFAULT ONERROR <error handler>;
> >
> > * Built in error handlers:
> > Ignore() => ignores errors by dropping records that cause exceptions
> > Fail() => fails the script on error. (default)
> > FailOnThreshold(threshold) => fails if number of errors above threshold
> >
> > * The error handler interface needs a method called on client side after
> > the relation is computed to decide what to do next.
> > Typically FailOnThreshold will throw an exception if
> > (#errors/#input)>threshold using counters.
> > public interface ErrorHandler<T> {
> >
> > // input is not the input of the UDF, it's the tuple from the relation
> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> >  IOException;
> >
> > Schema outputSchema(Schema input);
> >
> > // called afterwards on the client side
> > void collectResult() throws IOException;
> >
> > }
> >
> > * SPLIT is optional
> >
> > example:
> > DEFAULT ONERROR Ignore();
> > ...
> >
> > DESCRIBE A;
> > A: {name: chararray, age: int, gpa: float}
> >
> > -- fail it more than 1% errors
> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> > FailOnThreshold(0.01) ;