Hey hive gurus -
I recently had some issues getting Hive to process a partition with bad
records, and am curious how others deal with this issue. From searching
around, I learned Hive uses the MR-provided bad record skipping
functionality, instead of doing anything specific about bad records.
The partition I processed was roughly 87GB, with around 600 million records.
The job eventually completed (with 350 task failures) with these settings:
I believe this means 350 records (~0.0000005%) caused the job to initially
The code throwing the exception has a todo to discuss record
Has a discussion around natively handling bad records happened? As a
comparison, Elephant-Bird handles some percent of bad
causing task failures.