Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Lots of SocketTimeoutException for gets and puts since HBase 0.92.1


+
Guillaume Perrot 2012-11-15, 13:21
+
Stack 2012-11-16, 00:56
+
Vincent Barat 2012-11-16, 16:14
Copy link to this message
-
Re: Lots of SocketTimeoutException for gets and puts since HBase 0.92.1
Vincent:
What's the value for hbase.regionserver.handler.count ?

I assume you keep the same value as that from 0.90.3

Thanks

On Fri, Nov 16, 2012 at 8:14 AM, Vincent Barat <[EMAIL PROTECTED]>wrote:

> Le 16/11/12 01:56, Stack a écrit :
>
>  On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot <[EMAIL PROTECTED]>
>> wrote:
>>
>>> It happens when several tables are being compacted and/or when there is
>>> several scanners running.
>>>
>>
>> It happens for a particular region?  Anything you can tell about the
>> server looking in your cluster monitoring?  Is it running hot?  What
>> do the hbase regionserver stats in UI say?  Anything interesting about
>> compaction queues or requests?
>>
>
> Hi, thanks for your answser Stack. I will take the lead on that thread
> from now on.
>
> It does not happens on any particular region. Actually, things get better
> now since compactions have been performed on all tables and have been
> stopped.
>
> Nevertheless, we face a dramatic decrease of performances (especially on
> random gets) of the overall cluster:
>
> Despite the fact we double our number of region servers (from 8 to 16) and
> despite the fact that these region server CPU load are just about 10% to
> 30%, performances are really bad : very often an light increase of request
> lead to a clients locked on request, very long response time. It looks like
> a contention / deadlock somewhere in the HBase client and C code.
>
>
>
>> If you look at the thread dump all handlers are occupied serving
>> requests?  These timedout requests couldn't get into the server?
>>
> We will investigate on that and report to you.
>
>
>  Before the timeouts, we observe an increasing CPU load on a single region
>>> server and if we add region servers and wait for rebalancing, we always
>>> have the same region server causing problems like these:
>>>
>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC
>>> Server Responder, call
>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc
>>> version=1, client version=29, methodsFingerPrint=54742778 from
>>> <ip>:45334: output error
>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC
>>> Server handler 3 on 60020 caught: java.nio.channels.**
>>> ClosedChannelException
>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(**
>>> SocketChannelImpl.java:133)
>>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324)
>>> at
>>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(**
>>> HBaseServer.java:1653)
>>> at
>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
>>> processResponse(HBaseServer.**java:924)
>>> at
>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
>>> doRespond(HBaseServer.java:**1003)
>>> at
>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady(
>>> HBaseServer.java:409)
>>> at
>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(**
>>> HBaseServer.java:1346)
>>>
>>> With the same access patterns, we did not have this issue in HBase
>>> 0.90.3.
>>>
>>
>> The above is other side of the timeout -- the client is gone.
>>
>> Can you explain the rising CPU?
>>
> No there is no explanation (no high access a a given region for exemple).
> But this specific problem has gone when we finished compactions.
>
>
>     Is it iowait on this box because of
>> compactions?  Bad disk?  Always same regionserver or issue moves
>> around?
>>
>> Sorry for all the questions.  0.92 should be better than 0.90
>>
> Our experience is currently the exact opposite : for us, 0.92 seems to be
> times slower than the 0.90.3.
>
>  generally (0.94 even better still -- can you go there?).
>>
>
> We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we
> cannot go back to 0.90.3, since there is apparently a modification of the
> format of the ROOT table).
> The upgrade works, but the downgrade not. And we are afraid of having even
> more "new" problems with 0.94 and be forced to rollback to 0.90.3 (with
+
Vincent Barat 2012-11-16, 17:20
+
Vincent Barat 2012-11-16, 19:55