-Re: Skipping Bad Records
Tom White 2011-10-13, 21:31
The skipping feature should really only be used when you are calling
out to a third-party library that may segfault on corrupt data, and
even then it's probably better to use a subprocess to handles it, as
Owen suggested here:
In other cases you should handle the corrupt data in your mapper or
reducer, by catching the relevant exception, for example.
On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <[EMAIL PROTECTED]> wrote:
> Thanks for the info. If I get some time maybe I can assist. I'm
> looking over your code now. For now I am failing the files with the
> mapred.max.map.failures.percent property, but I'm losing a lot of good
> data going that route.
> On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Unfortunately not. The new API does not have a skipping feature yet
>> like the older one.
>> I did get started on some work on
>> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
>> haven't been able to find time to complete it with proper tests and
>> such. I'll try to do it within a week from now.
>> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <[EMAIL PROTECTED]> wrote:
>>> Can anyone confirm whether the skip options work for MR jobs using the
>>> new API? I have a job using the new API and I cannot get the job to
>>> skip corrupted records. I tried configuring job properties manually
>>> and using the SkipBadRecords class.
>> Harsh J