|
|
-
Re: Skipping Bad RecordsTom White 2011-10-13, 21:31
Justin,
The skipping feature should really only be used when you are calling out to a third-party library that may segfault on corrupt data, and even then it's probably better to use a subprocess to handles it, as Owen suggested here: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+[EMAIL PROTECTED]%3e. In other cases you should handle the corrupt data in your mapper or reducer, by catching the relevant exception, for example. Tom On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <[EMAIL PROTECTED]> wrote: > Harsh, > > Thanks for the info. If I get some time maybe I can assist. I'm > looking over your code now. For now I am failing the files with the > mapred.max.map.failures.percent property, but I'm losing a lot of good > data going that route. > > Justin > > On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> Justin, >> >> Unfortunately not. The new API does not have a skipping feature yet >> like the older one. >> >> I did get started on some work on >> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I >> haven't been able to find time to complete it with proper tests and >> such. I'll try to do it within a week from now. >> >> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <[EMAIL PROTECTED]> wrote: >>> Can anyone confirm whether the skip options work for MR jobs using the >>> new API? I have a job using the new API and I cannot get the job to >>> skip corrupted records. I tried configuring job properties manually >>> and using the SkipBadRecords class. >>> >>> Thanks, >>> Justin >>> >> >> >> >> -- >> Harsh J >> > |