Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> MR job "randomly" scans up thousands of rows less than the it should.

Copy link to this message
MR job "randomly" scans up thousands of rows less than the it should.
We have a MR job that runs every few minutes on some time series data which is continuously updated (never deleted).
Every few (in the range of tens to hundreds) runs the map task that covers the last region will get fewer input records (off by 500-5000 rows) without any splits happening. This lower number of input records could persist for a few MR runs, but will eventually get back to the "correct" value.

This drop can be seen both in the "map input records" metric but it's correlated with the metrics that get computed by the MR job (so it's not a MR counter bug).

There are no exceptions in the MR job, or in the region server and this doesn't seem to be correlated with any compaction, split or region movement.
The only "variable" in this scenario is that new data gets injected continuously (and the actual MR job which is idempotent)

This entire puzzle takes place on  HBase 0.90.5 –ish (12 dec 2011) on top of Hadoop cdh3u2.