I have a job that does heavy writing into HBase. The most recent run was
94 million records, each being put to two tables: one table stores a
KeyValue per record, while the other table batches them up into bundles of
up to a few thousand per bundle. This latest run took about 25 minutes to
We are currently in a phase of development where we need to do these
migrations often, and we noticed that enabling the WAL slows the job down
about 6-8x. In the interest of speed, we have disabled the WAL and added
the following safeguards:
1) At the beginning of the job we check for any dead servers. At the end
of the job we check again, and compare. If there is a new dead server, we
retry the job (the jobs are idempotent/reentrant).
2) At the end of the job, if no servers were lost, we force a memstore
flush on the tables that were saved to, using HBaseAdmin.flush(String
tableName). We then poll the HServerLoad.RegionLoad for all regions of the
tables we flushed, checking the memStoreSizeMB and waiting until it reaches
0 (obviously with a time limit, which causes the job to fail).
We feel as though these two mechanisms give us enough protection against
losing data from region server loss, since the hadoop job is the only
process saving to the tables.
I use this same technique on another smaller job as well, and that one has
worked fine. However, on this larger job I am seeing a
when trying to call the initial flush(). We have re-ran the job a few
times now, and each time this has happened, with a different region.
Searching for that region on the Admin UI confirms it doesn't exist.
Trying to flush the table manually with the hbase shell shows the same
problem. However, I also tried running the flush from the shell maybe
10-20 minutes after the job finished and it worked that time.
Is it possible that a split or compaction is happening at the time of the
flush and the region either temporarily becomes unavailable? Any thoughts