-RE: Heavy Writes Block Reads to RegionServer
Buckley,Ron 2012-05-16, 00:02
We ran into that same condition here last week doing pretty much the
same thing. Maybe you're hitting it too.
We found that the region server wasn't blocked all the time, but when it
was blocked there was a associated log message ("INFO
org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC
Server handler 1 on 9009' on region
77f.: memstore size 256.1m is >= than blocking 256.0m size") in our
logs. We had the same IPC Server Info that you described too.
It turned out, when that region would block, that the mappers were
taking all the rpc listener slots into the region server (visible from
the region server directly in the "Show Active RPC Calls"). Since the
mappers had all the slots, our gets for other tables would wait just to
get into the region server.
The rpc handler count is configurable, see:
We upped our value for that from it's default of 10 to 50 (more than the
number of mappers we were running) and the problem went away.
I'm sure there's an art to setting that value, we think 50 will work
well for us. YMMV.
From: Bryan Beaudreault [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, May 15, 2012 5:39 PM
To: [EMAIL PROTECTED]
Subject: Heavy Writes Block Reads to RegionServer
We are running a job that does heavy writes into a new table. The table
not pre-split so it has 1 region. I know this is not recommended; we
doing it partially to test this particular case.
Here's what we're seeing:
1. Reads are entirely blocked. No reads to any region on that server
make it through.
2. Writes are insanely slow. Some writes appear to be taking over 10
3. All of the box's resources are quiet: Around < 20% CPU usage,
of memory to spare, iostat looked normal
4. ngrep showed only writes coming through. no reads
5. The logs showed lots of WARN org.apache.hadoop.ipc.HBaseServer:
Server Responder, call
10.211.117.161:34380: output error
Any ideas what's up? Is there some sort of global lock that might halt
reads during heavy writes? Anything else we can look for during this?
can rerun the job to reproduce this, as this is a test cluster which can
afford to be brought down.