Today I needed to restart one of my region servers, and did so without gracefully shutting down the datanode. For the next 1-2 minutes we had a bunch of failed queries from various other region servers trying to access that datanode. Looking at the logs, I saw that they were all socket timeouts after 60000 milliseconds.
We use HBase mostly as an online datastore, with various APIs powering various web apps and external consumers. Writes come from both the API in some cases, but we have continuous hadoop jobs feeding data in as well.
Since we have web app consumers, this 60 second timeout seems unreasonably long. If a datanode goes down, ideally the impact would be much smaller than that. I want to lower the dfs.socket.timeout to something like 5-10 seconds, but do not know the implications of this.
In googling I did not find much precedent for this, but I did find some people talking about upping the timeout to much longer than 60 seconds. Is it generally safe to lower this timeout dramatically if you want faster failures? Are there any downsides to this?