Trying to make a case for making the block eviction strategy smart and to
not evict remote blocks more frequently and make the requests more smarter.
The question here comes after I debugged the issue that I was having with
random region servers hitting high load averages. I initially thought the
problem was hardware related i.e. bad disk or network since the wait I/O
was too high but it was a combination of things.
I figured with SCR (short circuit read) ON the datanode should almost never
show high amount of block requests from the local regionservers. So my
starting point for debugging was the datanode since it was doing a ton of
I/O. The clienttrace logs helped me figure out which RS nodes were making
block requests. I hacked up a script to report which blocks are being
requested and how many times per minute. I found that some blocks were
being requested 10+ times in a minute and over 2000 times in an hour from
the same regionserver. This was causing the server to do 40+MB/s on reads
alone. That was on the higher side, the average was closer to 100 or so per
Now why did I end up in such a situation. It happened due to the fact that
I added servers to the cluster and rebalanced the cluster. At the same time
I added some drives and also removed the offending server in my setup. This
caused some of the data to not be co-located with the regionservers. Given
that major_compaction was disabled and it would not have run for a while
(atleast on some tables) these block requests would not go away. One of my
regionservers was totally overwhelmed. I made the situation worse when I
removed the server that was under heavy load with the assumption that it's
a hardware problem with the box without doing a deep dive (doh!). Given
that regionservers will be added in the future, I expect block locality to
go down till major_compaction runs. Also nodes can go down and cause this
problem. So I started thinking of probable solutions, but first some
- The surprising part was the regionservers were trying to make so many
requests for the same block in the same minute (let alone hour). Could this
happen because the original request took a few seconds and so the
regionserver re-requested ? I didn't see any fetch errors in the
regionserver logs for blocks.
- Even more strange; my heap size was at 11G and the time when this was
happening, the used heap was at 2-4G. I would have expected the heap to
grow higher than that since the blockCache should be using atleast 40% of
the available heap space.
- Another strange thing that I observed was, the block was being requested
from the same datanode every single time.
- Would it make sense to give remote blocks higher priority over the local
blocks that can be read via SCR and not let them get evicted if there is a
tie in which block to evict ?
- Should we throttle the number of outgoing requests for a block ? I am not
sure if my firewall caused some issue but I wouldn't expect multiple block
fetch requests in the same minute. I did see a few RST packets getting
dropped at the firewall but I wasn't able to trace the problem was due to
- We have 3 replicas available, shouldn't we request from the other
datanode if one might take a lot of time ? The amount of time it took to
read a block went up when the box was under heavy load, yet the re-requests
were going to the same one. Is this something that is available on the
DFSClient and can we exploit it ?
- Is it possible to migrate a region to a server which has higher number of
blocks available for it ? We don't need to make this automatic, but we
could provide a command that could be invoked manually to assign a region
to a specific regionserver. Thoughts ?