Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Skipping bad records

Copy link to this message
Skipping bad records
Hey hive gurus -

I recently had some issues getting Hive to process a partition with bad
records, and am curious how others deal with this issue. From searching
around, I learned Hive uses the MR-provided bad record skipping
functionality, instead of doing anything specific about bad records.

The partition I processed was roughly 87GB, with around 600 million records.

The job eventually completed (with 350 task failures) with these settings:

set mapred.skip.mode.enabled=true;
set mapred.map.max.attempts=100;
set mapred.reduce.max.attempts=100;
set mapred.skip.map.max.skip.records=30000;
set mapred.skip.attempts.to.start.skipping=1;

I believe this means 350 records (~0.0000005%) caused the job to initially

The code throwing the exception has a todo to discuss record
Has a discussion around natively handling bad records happened? As a
comparison, Elephant-Bird handles some percent of bad
causing task failures.