-Re: Client receives SocketTimeoutException (CallerDisconnected on RS)
Adrien Mogenet 2012-08-24, 15:56
G'd evening everyone,
Here are the logs from the server side : http://pastebin.com/yC5dGChh
And from the client side : http://pastebin.com/tR7wdkxG
I followed your advices and I noticed :
First, U thought the "bad" RS were the one with the highest number of
sockets. But rebooting each of them did not change anything. Then, I
checked GC log again. Nothing special there.
Finally, I noticed something strange : according to the web-UI, my
data is roughly well distributed among regions/RS (pre-split table
with 1 region per RS). I mean that the displayed "number of requests"
is the same everywhere. But when I look at the "blocking RS", the
region it's trying to access has 178 StoreFiles in it, for a total of
160 GB, whereas other servers are handling a very few amount of data.
My rowkey is a simple MD5, presplit from "000... (x32)" to "FFF..
(x32)". Maybe it's another topic, but I'm feeling it could result in
slow response time, even if Index size is low enough to fit in memory
(~120 MB / 12 GB of allocated heap).
What do you think about that hypothesis ?
On Thu, Aug 23, 2012 at 8:02 PM, Adrien Mogenet
<[EMAIL PROTECTED]> wrote:
> Hi guys,
> 1/ I checked quickly the GC logs and saw nothing. Since I need very
> fast lookup I set the zookeeper.session.timeout parameter to 10s to
> consider the RS as dead after very short pauses, and that did not
> 2/ I did not check but I don't think I ran out of sockets since the
> ulimit has been set very high, but I'll check !
> 3/ Benchmark can launch several R/W threads, but even the simplest
> program leads to my issue :
> Configuration config = HBaseConfiguration.create();
> HTable table = new HTable(config, "test");
> for (<1, 10, 100 or 1000>)
> getsList.add(new Get(<randomKey>)
> 4/ I will share more logs tomorrow to dig deeper, I personally need a
> long STW-pause :-)
> On Thu, Aug 23, 2012 at 7:49 PM, N Keywal <[EMAIL PROTECTED]> wrote:
>> Hi Adrien,
>> As well, if you can share the client code (number of threads, regions,
>> is it a set of single get, or are they multi gets, this kind of
>> On Thu, Aug 23, 2012 at 7:40 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
>>> Hi Adrien,
>>> I would love to see the region server side of the logs while those
>>> socket timeouts happen, also check the GC log, but one thing people
>>> often hit while doing pure random read workloads with tons of clients
>>> is running out of sockets because they are all stuck in CLOSE_WAIT.
>>> You can check that by using lsof. There are other discussion on this
>>> mailing list about it.
>>> On Thu, Aug 23, 2012 at 10:24 AM, Adrien Mogenet
>>> <[EMAIL PROTECTED]> wrote:
>>>> Hi there,
>>>> While I'm performing read-intensive benchmarks, I'm seeing storm of
>>>> "CallerDisconnectedException" in certain RegionServers. As the
>>>> documentation says, my client received a SocketTimeoutException
>>>> (60000ms etc...) at the same time.
>>>> It's always happening and I get very poor read-performances (from 10
>>>> to 5000 reads/sc) in a 10 nodes cluster.
>>>> My benchmark consists in several iterations launching 10, 100 and 1000
>>>> Get requests on a given random rowkey with a single CF/qualifier.
>>>> I'm using HBase 0.94.1 (a few commits before the official stable
>>>> release) with Hadoop 1.0.3.
>>>> Bloom filters have been enabled (at the rowkey level).
>>>> I do not find very clear informations about these exceptions. From the
>>>> reference guide :
>>>> (...) you should consider digging in a bit more if you aren't doing
>>>> something to trigger them.
>>>> Well... could you help me digging? :-)