Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Re: Exception Handling in Pig Scripts


+
Julien Le Dem 2011-01-18, 18:27
+
Milind Bhandarkar 2011-01-18, 18:49
+
Koji Noguchi 2011-01-18, 18:48
Copy link to this message
-
Re: Exception Handling in Pig Scripts
In some cases you just don't care and want to skip a couple bad records. For example, you're writing ad hoc scripts to extract some stats.
In other cases you have a production system based on Pig and you want to have clear metrics of the ignored data (without adding extra filtering and complexity to your algorithm).

The idea is to be able to handle both.
What about this in the case you describe?
FOREACH FOO GENERATE Bar(*) ON_ERROR SkipMaxHandler(5);

And I would throw in as well:
DEFAULT ON_ERROR SPLIT MyHandler INTO ERRORS;
(It would need to append to the relation. ERRORS = UNION ERRORS, NEW_ERRORS ?)
Julien

On 1/18/11 10:48 AM, "Koji Noguchi" <[EMAIL PROTECTED]> wrote:

If we're talking about couple of  bad records, can we directly use skip-record feature in mapreduce?

Koji
On 1/18/11 10:27 AM, "Julien Le Dem" <[EMAIL PROTECTED]> wrote:

That would be nice.
Also letting the error handler output the result to a relation would be useful.
(To let the script output application error metrics)
For example it could (optionally) use the keyword INTO just like the SPLIT operator.

FOO = LOAD ...;
A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;

ErrorHandler would look a little more like EvalFunc:

public interface ErrorHandler<T> {

  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

public Schema outputSchema(Schema input);

}

There could be a built-in handler to output the skipped record (input: tuple, funcname:chararray, errorMessage:chararray)

A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;

Julien

On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

I was thinking about this..

We add an optional ON_ERROR clause to operators, which allows a user to
specify error handling. The error handler would be a udf that would
implement an interface along these lines:

public interface ErrorHandler {

  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

}

I think this makes sense not to make a static method so that users could
keep required state, and for example have the handler throw its own
IOException of it's been invoked too many times.

D
On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <[EMAIL PROTECTED]>wrote:

> Thanks for the clarification Ashutosh.
>
> Implementing this in the user realm is tricky as Dmitriy states.
> Sensitivity to error thresholds will require support from the system. We can
> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
> users classify each record. The system can then track counts of each record
> type to facilitate the computation of thresholds. The last part is to allow
> users to specify thresholds and appropriate actions (interrupt, exit,
> continue, etc.). A possible mechanism to realize this is the
> ErrorHandlingUDF described by Dmitriy.
>
> Santhosh
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:[EMAIL PROTECTED]]
> Sent: Friday, January 14, 2011 7:35 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Exception Handling in Pig Scripts
>
> Santhosh,
>
> The way you are proposing, it will kill the pig script. I think what user
> wants is to ignore few "bad records" and to process the rest and get
> results. Problem here is how to let user tell Pig the definition of "bad
> record" and how to let him specify threshold for % of bad records at which
> Pig should fail the script.
>
> Ashutosh
>
> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <[EMAIL PROTECTED]>
> wrote:
> > Sorry about the late response.
> >
> > Hadoop n00b is proposing a language extension for error handling, similar
> to the mechanisms in other well known languages like C++, Java, etc.
> >
> > For now, can't the error semantics be handled by the UDF? For exceptional
> scenarios you could throw an ExecException with the right details. The
> physical operator that handles the execution of UDF's traps it for you and
> propagates the error back to the client. You can take a look at any of the
+
Dmitriy Ryaboy 2011-01-18, 23:24
+
Julien Le Dem 2011-01-19, 23:07
+
Dmitriy Ryaboy 2011-01-20, 12:54
+
Julien Le Dem 2011-01-20, 19:31
+
Dmitriy Ryaboy 2011-01-20, 19:51
+
Olga Natkovich 2011-01-20, 20:53
+
Julien Le Dem 2011-01-20, 21:49
+
Olga Natkovich 2011-01-20, 23:19
+
Julien Le Dem 2011-01-21, 02:35
+
Olga Natkovich 2011-01-21, 21:16
+
Ashutosh Chauhan 2011-01-20, 21:30
+
Julien Le Dem 2011-01-21, 03:01