Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Skipping Bad Records


+
Justin Woody 2011-10-12, 16:36
+
Harsh J 2011-10-12, 20:27
+
Justin Woody 2011-10-13, 12:41
Copy link to this message
-
Re: Skipping Bad Records
Justin,

The skipping feature should really only be used when you are calling
out to a third-party library that may segfault on corrupt data, and
even then it's probably better to use a subprocess to handles it, as
Owen suggested here:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3cCAFQoU9Ekv+SBvAv-bSF5dORJO68VSj6zTqXywWUT+[EMAIL PROTECTED]%3e.

In other cases you should handle the corrupt data in your mapper or
reducer, by catching the relevant exception, for example.

Tom

On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <[EMAIL PROTECTED]> wrote:
> Harsh,
>
> Thanks for the info. If I get some time maybe I can assist. I'm
> looking over your code now. For now I am failing the files with the
> mapred.max.map.failures.percent property, but I'm losing a lot of good
> data going that route.
>
> Justin
>
> On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Justin,
>>
>> Unfortunately not. The new API does not have a skipping feature yet
>> like the older one.
>>
>> I did get started on some work on
>> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I
>> haven't been able to find time to complete it with proper tests and
>> such. I'll try to do it within a week from now.
>>
>> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <[EMAIL PROTECTED]> wrote:
>>> Can anyone confirm whether the skip options work for MR jobs using the
>>> new API? I have a job using the new API and I cannot get the job to
>>> skip corrupted records. I tried configuring job properties manually
>>> and using the SkipBadRecords class.
>>>
>>> Thanks,
>>> Justin
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
+
Justin Woody 2011-10-14, 11:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB