Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Re: Exception Handling in Pig Scripts


Copy link to this message
-
Re: Exception Handling in Pig Scripts
That would be nice.
Also letting the error handler output the result to a relation would be useful.
(To let the script output application error metrics)
For example it could (optionally) use the keyword INTO just like the SPLIT operator.

FOO = LOAD ...;
A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;

ErrorHandler would look a little more like EvalFunc:

public interface ErrorHandler<T> {

  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

public Schema outputSchema(Schema input);

}

There could be a built-in handler to output the skipped record (input: tuple, funcname:chararray, errorMessage:chararray)

A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;

Julien

On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

I was thinking about this..

We add an optional ON_ERROR clause to operators, which allows a user to
specify error handling. The error handler would be a udf that would
implement an interface along these lines:

public interface ErrorHandler {

  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

}

I think this makes sense not to make a static method so that users could
keep required state, and for example have the handler throw its own
IOException of it's been invoked too many times.

D
On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <[EMAIL PROTECTED]>wrote:

> Thanks for the clarification Ashutosh.
>
> Implementing this in the user realm is tricky as Dmitriy states.
> Sensitivity to error thresholds will require support from the system. We can
> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
> users classify each record. The system can then track counts of each record
> type to facilitate the computation of thresholds. The last part is to allow
> users to specify thresholds and appropriate actions (interrupt, exit,
> continue, etc.). A possible mechanism to realize this is the
> ErrorHandlingUDF described by Dmitriy.
>
> Santhosh
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:[EMAIL PROTECTED]]
> Sent: Friday, January 14, 2011 7:35 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Exception Handling in Pig Scripts
>
> Santhosh,
>
> The way you are proposing, it will kill the pig script. I think what user
> wants is to ignore few "bad records" and to process the rest and get
> results. Problem here is how to let user tell Pig the definition of "bad
> record" and how to let him specify threshold for % of bad records at which
> Pig should fail the script.
>
> Ashutosh
>
> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <[EMAIL PROTECTED]>
> wrote:
> > Sorry about the late response.
> >
> > Hadoop n00b is proposing a language extension for error handling, similar
> to the mechanisms in other well known languages like C++, Java, etc.
> >
> > For now, can't the error semantics be handled by the UDF? For exceptional
> scenarios you could throw an ExecException with the right details. The
> physical operator that handles the execution of UDF's traps it for you and
> propagates the error back to the client. You can take a look at any of the
> builtin UDFs to see how Pig handles it internally.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, January 11, 2011 10:41 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Right now error handling is controlled by the UDFs themselves, and there
> is no way to direct it externally.
> > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> trap errors, and then do the specified error handling behavior.. that's a
> bit ugly though.
> >
> > There is a problem with trapping general exceptions of course, in that if
> they happen 0.000001% of the time you can probably just ignore them, but if
> they happen in half your dataset, you want the job to tell you something is
> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
+
Milind Bhandarkar 2011-01-18, 18:49
+
Koji Noguchi 2011-01-18, 18:48
+
Julien Le Dem 2011-01-18, 20:04
+
Dmitriy Ryaboy 2011-01-18, 23:24
+
Julien Le Dem 2011-01-19, 23:07
+
Dmitriy Ryaboy 2011-01-20, 12:54
+
Julien Le Dem 2011-01-20, 19:31
+
Dmitriy Ryaboy 2011-01-20, 19:51
+
Olga Natkovich 2011-01-20, 20:53
+
Julien Le Dem 2011-01-20, 21:49
+
Olga Natkovich 2011-01-20, 23:19
+
Julien Le Dem 2011-01-21, 02:35
+
Olga Natkovich 2011-01-21, 21:16
+
Ashutosh Chauhan 2011-01-20, 21:30
+
Julien Le Dem 2011-01-21, 03:01