Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Re: Exception Handling in Pig Scripts


Copy link to this message
-
Re: Exception Handling in Pig Scripts
Ashutosh Chauhan 2011-01-20, 21:30
If its not already been discussed, how does this interact with
hadoop's feature of skipping bad records:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/SkipBadRecords.html

Ashutosh
On Thu, Jan 20, 2011 at 12:53, Olga Natkovich <[EMAIL PROTECTED]> wrote:
> Hi guys,
>
> Could you put a quick wiki with your proposal together? I think it would make it much easier then following email discussion.
>
> Thanks,
>
> Olga
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, January 20, 2011 11:52 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Exception Handling in Pig Scripts
>
> Right, what I am saying is that the tasks would not fail because we'd catch
> the errors.
>
> Thanks for the lmyit link.. learn something new every day.
>
> On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <[EMAIL PROTECTED]>wrote:
>
>> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
>> would expect that, but I don't know)
>> Also using counters we should make sure we don't mix up multiple relations
>> being combined by the optimizer.
>>
>> P.S.: Regarding rror, I don't see why you would want two of these:
>> http://lmyit.com/rror
>> :P
>>
>>
>> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>>
>> I think this is coming together! I like the idea of a client-side handler
>> method that allows us to look at all errors in aggregate and make a
>> decisions based on proportions. How can we guard against catching the wrong
>> mistakes -- say, letting a mapper that's running on a bad node and fails
>> all
>> local disk writes finish "successfully" even though properly, the task just
>> needs to be rerun on a different mapper and normally MR would just take
>> care
>> of it?
>> Let's put this on a wiki for wider feedback.
>>
>> P.S. What's a "rror" and why do we only want one of them?
>>
>> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <[EMAIL PROTECTED]>
>> wrote:
>>
>> > Some more thoughts.
>> >
>> > * Looking at the existing keywords:
>> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
>> > It seems ONERROR would be better than ON_ERROR for consistency. There is
>> an
>> > existing ONSCHEMA but no _ based keyword.
>> >
>> > * The default behavior should be to die on error and can be overridden as
>> > follows:
>> > DEFAULT ONERROR <error handler>;
>> >
>> > * Built in error handlers:
>> > Ignore() => ignores errors by dropping records that cause exceptions
>> > Fail() => fails the script on error. (default)
>> > FailOnThreshold(threshold) => fails if number of errors above threshold
>> >
>> > * The error handler interface needs a method called on client side after
>> > the relation is computed to decide what to do next.
>> > Typically FailOnThreshold will throw an exception if
>> > (#errors/#input)>threshold using counters.
>> > public interface ErrorHandler<T> {
>> >
>> > // input is not the input of the UDF, it's the tuple from the relation
>> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> >  IOException;
>> >
>> > Schema outputSchema(Schema input);
>> >
>> > // called afterwards on the client side
>> > void collectResult() throws IOException;
>> >
>> > }
>> >
>> > * SPLIT is optional
>> >
>> > example:
>> > DEFAULT ONERROR Ignore();
>> > ...
>> >
>> > DESCRIBE A;
>> > A: {name: chararray, age: int, gpa: float}
>> >
>> > -- fail it more than 1% errors
>> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
>> > FailOnThreshold(0.01) ;
>> >
>> > -- need to make sure the twitter infrastructure can handle the load
>> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>> >
>> > -- custom handler that counts errors and logs on the client side
>> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
>> ;
>> >
>> > -- uses default handler and SPLIT
>> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
>> > B2_ERRORS;
>> >
>> > -- B2_ERRORS can not really contain the input to the UDF as it would have