Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Skipping bad records


Copy link to this message
-
Skipping bad records
Hey hive gurus -

I recently had some issues getting Hive to process a partition with bad
records, and am curious how others deal with this issue. From searching
around, I learned Hive uses the MR-provided bad record skipping
functionality, instead of doing anything specific about bad records.

The partition I processed was roughly 87GB, with around 600 million records.

The job eventually completed (with 350 task failures) with these settings:

set mapred.skip.mode.enabled=true;
set mapred.map.max.attempts=100;
set mapred.reduce.max.attempts=100;
set mapred.skip.map.max.skip.records=30000;
set mapred.skip.attempts.to.start.skipping=1;

I believe this means 350 records (~0.0000005%) caused the job to initially
fail?

The code throwing the exception has a todo to discuss record
deserialization
errors<https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L508>.
Has a discussion around natively handling bad records happened? As a
comparison, Elephant-Bird handles some percent of bad
records<https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/LzoRecordReader.java>
without
causing task failures.

Thanks!
Travis
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB