Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Re: Exception Handling in Pig Scripts

Copy link to this message
Re: Exception Handling in Pig Scripts
Julien Le Dem 2011-01-20, 19:31
Doesn't Hadoop discard the increments to counters done by failed tasks? (I would expect that, but I don't know)
Also using counters we should make sure we don't mix up multiple relations being combined by the optimizer.

P.S.: Regarding rror, I don't see why you would want two of these:
On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

I think this is coming together! I like the idea of a client-side handler
method that allows us to look at all errors in aggregate and make a
decisions based on proportions. How can we guard against catching the wrong
mistakes -- say, letting a mapper that's running on a bad node and fails all
local disk writes finish "successfully" even though properly, the task just
needs to be rerun on a different mapper and normally MR would just take care
of it?
Let's put this on a wiki for wider feedback.

P.S. What's a "rror" and why do we only want one of them?

On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <[EMAIL PROTECTED]> wrote:

> Some more thoughts.
> * Looking at the existing keywords:
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> It seems ONERROR would be better than ON_ERROR for consistency. There is an
> existing ONSCHEMA but no _ based keyword.
> * The default behavior should be to die on error and can be overridden as
> follows:
> DEFAULT ONERROR <error handler>;
> * Built in error handlers:
> Ignore() => ignores errors by dropping records that cause exceptions
> Fail() => fails the script on error. (default)
> FailOnThreshold(threshold) => fails if number of errors above threshold
> * The error handler interface needs a method called on client side after
> the relation is computed to decide what to do next.
> Typically FailOnThreshold will throw an exception if
> (#errors/#input)>threshold using counters.
> public interface ErrorHandler<T> {
> // input is not the input of the UDF, it's the tuple from the relation
> T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>  IOException;
> Schema outputSchema(Schema input);
> // called afterwards on the client side
> void collectResult() throws IOException;
> }
> * SPLIT is optional
> example:
> ...
> A: {name: chararray, age: int, gpa: float}
> -- fail it more than 1% errors
> B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> FailOnThreshold(0.01) ;
> -- need to make sure the twitter infrastructure can handle the load
> C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
> -- custom handler that counts errors and logs on the client side
> D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;
> -- uses default handler and SPLIT
> -- B2_ERRORS can not really contain the input to the UDF as it would have a
> different schema depending on what UDF failed
> B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray,
> error:(class: chararray, message: chararray, stacktrace: chararray) }
> -- example of filtering on the udf
> C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
> Julien
> On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
> We should think more about the interface.
> For example, "Tuple input" argument -- is that the tuple that was passed to
> the udf, or the whole tuple that was being processed? I can see wanting
> both.
> Also the Handler should probably have init and finish methods in case some
> accumulation is happening, or state needs to get set up...
> not sure about "splitting" into a table. Maybe more like
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> "use" and "into" are optional syntactic sugar.
> This allows us to do any combination of:
> - die
> - put original record into a table