Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: Exception Handling in Pig Scripts


Copy link to this message
-
Re: Exception Handling in Pig Scripts
My opinion is that the Pig feature would not use it.
What we're discussing is more granular and prevents the task to fail. Also it allows actual handling of the error in a simple way (in Pig).
As Map-Reduce is mainly executing Java code, you can do the same thing by adding a try-catch statement and use a MultipleOutputFormat to send bad records under a different name.
In Pig, the UDF can not have multiple outputs so we need to add a mechanism to easily handle exceptions separately.

Julien

On 1/20/11 1:30 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote:

If its not already been discussed, how does this interact with
hadoop's feature of skipping bad records:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/SkipBadRecords.html

Ashutosh
On Thu, Jan 20, 2011 at 12:53, Olga Natkovich <[EMAIL PROTECTED]> wrote:
> Hi guys,
>
> Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.
>
> Thanks,
>
> Olga
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, January 20, 2011 11:52 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Exception Handling in Pig Scripts
>
> Right, what I am saying is that the tasks would not fail because we'd catch
> the errors.
>
> Thanks for the lmyit link.. learn something new every day.
>
> On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <[EMAIL PROTECTED]>wrote:
>
>> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
>> would expect that, but I don't know)
>> Also using counters we should make sure we don't mix up multiple relations
>> being combined by the optimizer.
>>
>> P.S.: Regarding rror, I don't see why you would want two of these:
>> http://lmyit.com/rror
>> :P
>>
>>
>> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>>
>> I think this is coming together! I like the idea of a client-side handler
>> method that allows us to look at all errors in aggregate and make a
>> decisions based on proportions. How can we guard against catching the wrong
>> mistakes -- say, letting a mapper that's running on a bad node and fails
>> all
>> local disk writes finish "successfully" even though properly, the task just
>> needs to be rerun on a different mapper and normally MR would just take
>> care
>> of it?
>> Let's put this on a wiki for wider feedback.
>>
>> P.S. What's a "rror" and why do we only want one of them?
>>
>> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <[EMAIL PROTECTED]>
>> wrote:
>>
>> > Some more thoughts.
>> >
>> > * Looking at the existing keywords:
>> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
>> > It seems ONERROR would be better than ON_ERROR for consistency. There is
>> an
>> > existing ONSCHEMA but no _ based keyword.
>> >
>> > * The default behavior should be to die on error and can be overridden as
>> > follows:
>> > DEFAULT ONERROR <error handler>;
>> >
>> > * Built in error handlers:
>> > Ignore() => ignores errors by dropping records that cause exceptions
>> > Fail() => fails the script on error. (default)
>> > FailOnThreshold(threshold) => fails if number of errors above threshold
>> >
>> > * The error handler interface needs a method called on client side after
>> > the relation is computed to decide what to do next.
>> > Typically FailOnThreshold will throw an exception if
>> > (#errors/#input)>threshold using counters.
>> > public interface ErrorHandler<T> {
>> >
>> > // input is not the input of the UDF, it's the tuple from the relation
>> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> >  IOException;
>> >
>> > Schema outputSchema(Schema input);
>> >
>> > // called afterwards on the client side
>> > void collectResult() throws IOException;
>> >
>> > }
>> >
>> > * SPLIT is optional
>> >
>> > example:
>> > DEFAULT ONERROR Ignore();
>> > ...
>> >
>> > DESCRIBE A;
>> > A: {name: chararray, age: int, gpa: float}
>> >
>> > -- fail it more than 1% errors