Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Re: Exception Handling in Pig Scripts


+
Julien Le Dem 2011-01-18, 18:27
+
Milind Bhandarkar 2011-01-18, 18:49
+
Koji Noguchi 2011-01-18, 18:48
+
Julien Le Dem 2011-01-18, 20:04
+
Dmitriy Ryaboy 2011-01-18, 23:24
+
Julien Le Dem 2011-01-19, 23:07
+
Dmitriy Ryaboy 2011-01-20, 12:54
Copy link to this message
-
Re: Exception Handling in Pig Scripts
Doesn't Hadoop discard the increments to counters done by failed tasks? (I would expect that, but I don't know)
Also using counters we should make sure we don't mix up multiple relations being combined by the optimizer.

P.S.: Regarding rror, I don't see why you would want two of these:
http://lmyit.com/rror
:P
On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

I think this is coming together! I like the idea of a client-side handler
method that allows us to look at all errors in aggregate and make a
decisions based on proportions. How can we guard against catching the wrong
mistakes -- say, letting a mapper that's running on a bad node and fails all
local disk writes finish "successfully" even though properly, the task just
needs to be rerun on a different mapper and normally MR would just take care
of it?
Let's put this on a wiki for wider feedback.

P.S. What's a "rror" and why do we only want one of them?

On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <[EMAIL PROTECTED]> wrote:

> Some more thoughts.
>
> * Looking at the existing keywords:
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> It seems ONERROR would be better than ON_ERROR for consistency. There is an
> existing ONSCHEMA but no _ based keyword.
>
> * The default behavior should be to die on error and can be overridden as
> follows:
> DEFAULT ONERROR <error handler>;
>
> * Built in error handlers:
> Ignore() => ignores errors by dropping records that cause exceptions
> Fail() => fails the script on error. (default)
> FailOnThreshold(threshold) => fails if number of errors above threshold
>
> * The error handler interface needs a method called on client side after
> the relation is computed to decide what to do next.
> Typically FailOnThreshold will throw an exception if
> (#errors/#input)>threshold using counters.
> public interface ErrorHandler<T> {
>
> // input is not the input of the UDF, it's the tuple from the relation
> T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>  IOException;
>
> Schema outputSchema(Schema input);
>
> // called afterwards on the client side
> void collectResult() throws IOException;
>
> }
>
> * SPLIT is optional
>
> example:
> DEFAULT ONERROR Ignore();
> ...
>
> DESCRIBE A;
> A: {name: chararray, age: int, gpa: float}
>
> -- fail it more than 1% errors
> B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> FailOnThreshold(0.01) ;
>
> -- need to make sure the twitter infrastructure can handle the load
> C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>
> -- custom handler that counts errors and logs on the client side
> D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;
>
> -- uses default handler and SPLIT
> B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> B2_ERRORS;
>
> -- B2_ERRORS can not really contain the input to the UDF as it would have a
> different schema depending on what UDF failed
> DESCRIBE B_ERRORS;
> B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray,
> error:(class: chararray, message: chararray, stacktrace: chararray) }
>
> -- example of filtering on the udf
> C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>
> Julien
>
> On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>
> We should think more about the interface.
> For example, "Tuple input" argument -- is that the tuple that was passed to
> the udf, or the whole tuple that was being processed? I can see wanting
> both.
> Also the Handler should probably have init and finish methods in case some
> accumulation is happening, or state needs to get set up...
>
> not sure about "splitting" into a table. Maybe more like
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> A_ERRORS;
>
> "use" and "into" are optional syntactic sugar.
>
> This allows us to do any combination of:
> - die
> - put original record into a table
+
Dmitriy Ryaboy 2011-01-20, 19:51
+
Olga Natkovich 2011-01-20, 20:53
+
Julien Le Dem 2011-01-20, 21:49
+
Olga Natkovich 2011-01-20, 23:19
+
Julien Le Dem 2011-01-21, 02:35
+
Olga Natkovich 2011-01-21, 21:16
+
Ashutosh Chauhan 2011-01-20, 21:30
+
Julien Le Dem 2011-01-21, 03:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB