-Re: Exception Handling in Pig Scripts
Ashutosh Chauhan 2011-01-15, 03:35
The way you are proposing, it will kill the pig script. I think what
user wants is to ignore few "bad records" and to process the rest and
get results. Problem here is how to let user tell Pig the definition
of "bad record" and how to let him specify threshold for % of bad
records at which Pig should fail the script.
On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <[EMAIL PROTECTED]> wrote:
> Sorry about the late response.
> Hadoop n00b is proposing a language extension for error handling, similar to the mechanisms in other well known languages like C++, Java, etc.
> For now, can't the error semantics be handled by the UDF? For exceptional scenarios you could throw an ExecException with the right details. The physical operator that handles the execution of UDF's traps it for you and propagates the error back to the client. You can take a look at any of the builtin UDFs to see how Pig handles it internally.
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, January 11, 2011 10:41 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Exception Handling in Pig Scripts
> Right now error handling is controlled by the UDFs themselves, and there is no way to direct it externally.
> You can make an ErrorHandlingUDF that would take a udf spec, invoke it, trap errors, and then do the specified error handling behavior.. that's a bit ugly though.
> There is a problem with trapping general exceptions of course, in that if they happen 0.000001% of the time you can probably just ignore them, but if they happen in half your dataset, you want the job to tell you something is wrong. So this stuff gets non-trivial. If anyone wants to propose a design to solve this general problem, I think that would be a welcome addition.
> On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <[EMAIL PROTECTED]> wrote:
>> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
>> date format, but when I try to get the seconds between this and
>> another date, say 2011-01-01, I get an error that the value is too
>> large to be fit into int and the process stops. Do we have something
>> like ifError(x-y, null,x-y)? Or would I have to implement this as an
>> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
>> > Create a UDF that verifies the format, and go through a filtering
>> > step first.
>> > If you would like to save the malformated records so you can look at
>> > them later, you can use the SPLIT operator to route the good records
>> > to your regular workflow, and the bad records some place on HDFS.
>> > -D
>> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <[EMAIL PROTECTED]> wrote:
>> > > Hello,
>> > >
>> > > I have a pig script that uses piggy bank to calculate date differences.
>> > > Sometimes, when I get a wierd date or wrong format in the input,
>> > > the
>> > script
>> > > throws and error and aborts.
>> > >
>> > > Is there a way I could trap these errors and move on without
>> > > stopping
>> > > execution?
>> > >
>> > > Thanks
>> > >
>> > > PS: I'm using CDH2 with Pig 0.5
>> > >