Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: Exception Handling in Pig Scripts


Copy link to this message
-
Re: Exception Handling in Pig Scripts
Doesn't Hadoop discard the increments to counters done by failed tasks? (I would expect that, but I don't know)
Also using counters we should make sure we don't mix up multiple relations being combined by the optimizer.

P.S.: Regarding rror, I don't see why you would want two of these:
http://lmyit.com/rror
:P
On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

I think this is coming together! I like the idea of a client-side handler
method that allows us to look at all errors in aggregate and make a
decisions based on proportions. How can we guard against catching the wrong
mistakes -- say, letting a mapper that's running on a bad node and fails all
local disk writes finish "successfully" even though properly, the task just
needs to be rerun on a different mapper and normally MR would just take care
of it?
Let's put this on a wiki for wider feedback.

P.S. What's a "rror" and why do we only want one of them?

On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <[EMAIL PROTECTED]> wrote:

> Some more thoughts.
>
> * Looking at the existing keywords:
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> It seems ONERROR would be better than ON_ERROR for consistency. There is an
> existing ONSCHEMA but no _ based keyword.
>
> * The default behavior should be to die on error and can be overridden as
> follows:
> DEFAULT ONERROR <error handler>;
>
> * Built in error handlers:
> Ignore() => ignores errors by dropping records that cause exceptions
> Fail() => fails the script on error. (default)
> FailOnThreshold(threshold) => fails if number of errors above threshold
>
> * The error handler interface needs a method called on client side after
> the relation is computed to decide what to do next.
> Typically FailOnThreshold will throw an exception if
> (#errors/#input)>threshold using counters.
> public interface ErrorHandler<T> {
>
> // input is not the input of the UDF, it's the tuple from the relation
> T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>  IOException;
>
> Schema outputSchema(Schema input);
>
> // called afterwards on the client side
> void collectResult() throws IOException;
>
> }
>
> * SPLIT is optional
>
> example:
> DEFAULT ONERROR Ignore();
> ...
>
> DESCRIBE A;
> A: {name: chararray, age: int, gpa: float}
>
> -- fail it more than 1% errors
> B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> FailOnThreshold(0.01) ;
>
> -- need to make sure the twitter infrastructure can handle the load
> C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>
> -- custom handler that counts errors and logs on the client side
> D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;
>
> -- uses default handler and SPLIT
> B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> B2_ERRORS;
>
> -- B2_ERRORS can not really contain the input to the UDF as it would have a
> different schema depending on what UDF failed
> DESCRIBE B_ERRORS;
> B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray,
> error:(class: chararray, message: chararray, stacktrace: chararray) }
>
> -- example of filtering on the udf
> C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>
> Julien
>
> On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>
> We should think more about the interface.
> For example, "Tuple input" argument -- is that the tuple that was passed to
> the udf, or the whole tuple that was being processed? I can see wanting
> both.
> Also the Handler should probably have init and finish methods in case some
> accumulation is happening, or state needs to get set up...
>
> not sure about "splitting" into a table. Maybe more like
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> A_ERRORS;
>
> "use" and "into" are optional syntactic sugar.
>
> This allows us to do any combination of:
> - die
> - put original record into a table