Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Re: Exception Handling in Pig Scripts


+
Julien Le Dem 2011-01-18, 18:27
+
Milind Bhandarkar 2011-01-18, 18:49
+
Koji Noguchi 2011-01-18, 18:48
+
Julien Le Dem 2011-01-18, 20:04
+
Dmitriy Ryaboy 2011-01-18, 23:24
Copy link to this message
-
Re: Exception Handling in Pig Scripts
Some more thoughts.

* Looking at the existing keywords:
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
It seems ONERROR would be better than ON_ERROR for consistency. There is an existing ONSCHEMA but no _ based keyword.

* The default behavior should be to die on error and can be overridden as follows:
DEFAULT ONERROR <error handler>;

* Built in error handlers:
Ignore() => ignores errors by dropping records that cause exceptions
Fail() => fails the script on error. (default)
FailOnThreshold(threshold) => fails if number of errors above threshold

* The error handler interface needs a method called on client side after the relation is computed to decide what to do next.
Typically FailOnThreshold will throw an exception if (#errors/#input)>threshold using counters.
public interface ErrorHandler<T> {

// input is not the input of the UDF, it's the tuple from the relation
T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
 IOException;

Schema outputSchema(Schema input);

// called afterwards on the client side
void collectResult() throws IOException;

}

* SPLIT is optional

example:
DEFAULT ONERROR Ignore();
...

DESCRIBE A;
A: {name: chararray, age: int, gpa: float}

-- fail it more than 1% errors
B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR FailOnThreshold(0.01) ;

-- need to make sure the twitter infrastructure can handle the load
C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;

-- custom handler that counts errors and logs on the client side
D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;

-- uses default handler and SPLIT
B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO B2_ERRORS;

-- B2_ERRORS can not really contain the input to the UDF as it would have a different schema depending on what UDF failed
DESCRIBE B_ERRORS;
B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray, error:(class: chararray, message: chararray, stacktrace: chararray) }

-- example of filtering on the udf
C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';

Julien

On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

We should think more about the interface.
For example, "Tuple input" argument -- is that the tuple that was passed to
the udf, or the whole tuple that was being processed? I can see wanting
both.
Also the Handler should probably have init and finish methods in case some
accumulation is happening, or state needs to get set up...

not sure about "splitting" into a table. Maybe more like

A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
A_ERRORS;

"use" and "into" are optional syntactic sugar.

This allows us to do any combination of:
- die
- put original record into a table
- process the error using a custom handler (which can increment counters,
write to dbs, send tweets... definitely send tweets...)

D

On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <[EMAIL PROTECTED]>wrote:

> That would be nice.
> Also letting the error handler output the result to a relation would be
> useful.
> (To let the script output application error metrics)
> For example it could (optionally) use the keyword INTO just like the SPLIT
> operator.
>
> FOO = LOAD ...;
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>
> ErrorHandler would look a little more like EvalFunc:
>
> public interface ErrorHandler<T> {
>
>  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> public Schema outputSchema(Schema input);
>
> }
>
> There could be a built-in handler to output the skipped record (input:
> tuple, funcname:chararray, errorMessage:chararray)
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>
> Julien
>
> On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>
> I was thinking about this..
>
> We add an optional ON_ERROR clause to operators, which allows a user to
+
Dmitriy Ryaboy 2011-01-20, 12:54
+
Julien Le Dem 2011-01-20, 19:31
+
Dmitriy Ryaboy 2011-01-20, 19:51
+
Olga Natkovich 2011-01-20, 20:53
+
Julien Le Dem 2011-01-20, 21:49
+
Olga Natkovich 2011-01-20, 23:19
+
Julien Le Dem 2011-01-21, 02:35
+
Olga Natkovich 2011-01-21, 21:16
+
Ashutosh Chauhan 2011-01-20, 21:30
+
Julien Le Dem 2011-01-21, 03:01