Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Skipping Bad Records


+
Justin Woody 2011-10-12, 16:36
+
Harsh J 2011-10-12, 20:27
+
Justin Woody 2011-10-13, 12:41
Copy link to this message
-
Re: Skipping Bad Records
Justin,

The skipping feature should really only be used when you are calling
out to a third-party library that may segfault on corrupt data, and
even then it's probably better to use a subprocess to handles it, as
Owen suggested here:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+[EMAIL PROTECTED]%3e.

In other cases you should handle the corrupt data in your mapper or
reducer, by catching the relevant exception, for example.

Tom

On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <[EMAIL PROTECTED]> wrote:
> Harsh,
>
> Thanks for the info. If I get some time maybe I can assist. I'm
> looking over your code now. For now I am failing the files with the
> mapred.max.map.failures.percent property, but I'm losing a lot of good
> data going that route.
>
> Justin
>
> On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Justin,
>>
>> Unfortunately not. The new API does not have a skipping feature yet
>> like the older one.
>>
>> I did get started on some work on
>> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
>> haven't been able to find time to complete it with proper tests and
>> such. I'll try to do it within a week from now.
>>
>> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <[EMAIL PROTECTED]> wrote:
>>> Can anyone confirm whether the skip options work for MR jobs using the
>>> new API? I have a job using the new API and I cannot get the job to
>>> skip corrupted records. I tried configuring job properties manually
>>> and using the SkipBadRecords class.
>>>
>>> Thanks,
>>> Justin
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
+
Justin Woody 2011-10-14, 11:59