|
Galed Friedmann
2012-01-30, 14:39
Stack
2012-01-31, 20:51
Galed Friedmann
2012-02-01, 09:00
Stack
2012-02-01, 16:57
Galed Friedmann
2012-02-01, 17:08
Stack
2012-02-01, 17:23
Galed Friedmann
2012-02-02, 07:59
yuzhihong@...
2012-02-02, 08:57
Galed Friedmann
2012-02-02, 09:33
Jean-Daniel Cryans
2012-02-02, 19:27
|
-
Thrift "hang ups" with no apparent reasonGaled Friedmann 2012-01-30, 14:39
Hi,
I have an HBase cluster which consists of 1 master server (running NameNode, Zoo Keeper and HBase Master) and 3 region masters (Running DataNode and Region Server). I also have a Thrift server running on the master. I have some Hadoop MR jobs running on a separate Hadoop cluster (using JRuby) and some other processes that use Thrift as the end point to HBase. All of this on EC2. Lately we're having weird issues with Thrift, after several hours the Thrift server "hangs" - the scripts that are using it to access HBase get connection timeouts, we're also using Heroku and ruby on rails apps that use Thrift and they simply get stuck. Only when restarting the Thrift process everything goes back to normal. I've tried tweaking everything I could, increasing the heap size of the Thrift process (to 4GB) only delayed the time until the hang ups appear (from around 4-5 hours to 9-10 hours) but did not fix the problem. Zoo Keeper and HBase Master also have 4GB heap size. The Thrift log files show nothing, the only thing I see in the logs are the establishment of connection when I brought the Thrift up (few hours before the hangups) and then when I restart it. Looking at the different log files this is what I see during the time the hangups start: *Zoo Keeper log at the time of the hangups, looking at the Thrift process session ID (0x1352a393d180008 and 0x1352a393d180009): * 2012-01-30 10:51:36,721 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: Unable to read additional data from client sessionid * 0x1352a393d180008*, likely client has closed socket 2012-01-30 10:51:36,721 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.217.55.193:53475 which had sessionid *0x1352a393d180008* 2012-01-30 10:51:36,721 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: Unable to read additional data from client sessionid * 0x1352a393d180009*, likely client has closed socket 2012-01-30 10:51:36,722 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.217.55.193:53477 which had sessionid *0x1352a393d180009* 2012-01-30 10:52:00,001 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x1352a393d18051c, timeout of 90000ms exceeded 2012-01-30 10:52:00,001 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1352a393d18051c 2012-01-30 10:52:06,040 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /10.217.55.193:35937 2012-01-30 10:52:06,043 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /10.217.55.193:35937 2012-01-30 10:52:06,044 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x1352a393d18051d with negotiated timeout 90000 for client /10.217.55.193:35937 2012-01-30 10:52:08,820 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /10.217.55.193:35940 2012-01-30 10:52:08,821 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /10.217.55.193:35940 2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x1352a393d18051e with negotiated timeout 90000 for client /10.217.55.193:35940 2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded 2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1352a393d18051b 2012-01-30 10:52:50,844 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /10.64.165.124:47983 2012-01-30 10:52:50,856 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /10.64.165.124:47983 2012-01-30 10:52:50,858 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x1352a393d18051f with negotiated timeout 90000 for client /10.64.165.124:47983 2012-01-30 10:52:54,243 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: Unable to read additional data from client sessionid 0x1352a393d18051f, likely client has closed socket 2012-01-30 10:52:54,244 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.64.165.124:47983 which had sessionid 0x1352a393d18051f 2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session *0x1352a393d180009*, timeout of 90000ms exceeded 2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session *0x1352a393d180008*, timeout of 90000ms exceeded 2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: *0x1352a393d180009* 2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: *0x1352a393d180008* * * * * *In addition to that, on one of the Region Servers I found this exception at the time of the hangup:* 2012-01-30 10:46:23,854 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 8801271291968240625 lease expired 2012-01-30 10:46:23,854 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 4523402662192609713 lease expired 2012-01-30 10:46:23,854 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -3235593536276390176 lease expired 2012-01-30 10:46:35,034 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -8329379051383952775 lease expired 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: readAndProcess threw exception java.io.IOException: Connection rese t by peer. Count of bytes read: 0 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) at sun.nio.ch.IOU
-
Re: Thrift "hang ups" with no apparent reasonStack 2012-01-31, 20:51
On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann
<[EMAIL PROTECTED]> wrote: > Lately we're having weird issues with Thrift, after several hours the > Thrift server "hangs" - the scripts that are using it to access HBase get > connection timeouts, we're also using Heroku and ruby on rails apps that > use Thrift and they simply get stuck. Only when restarting the Thrift > process everything goes back to normal. > Can you thread dump the thrift server when its all hung up? Have you enabled > 2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn: > Established session 0x1352a393d18051e with negotiated timeout 90000 for > client /10.217.55.193:35940 > 2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.ZooKeeperServer: > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded > 2012-01-30 10:52:28,001 INFO > org.apache.zookeeper.server.PrepRequestProcessor: Processed session > termination for sessionid: 0x1352a393d18051b ZK is establishing a session w/ 90second timeout and then timing out immediately? > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > listener on 60020: readAndProcess threw exception java.io.IOException: > Connection rese > t by peer. Count of bytes read: 0 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) > at sun.nio.ch.IOUtil.read(IOUtil.java:210) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > at > org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > 2012-01-30 10:52:24,016 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > -4511393305838866925 lease expired > 2012-01-30 10:52:24,016 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > -5818959718437063034 lease expired > 2012-01-30 10:52:24,016 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > -1408921590864341720 lease expired > Client went away? All the lease expireds happen always or just around time of the hangup (You are closing scanners when done?) St.Ack
-
Re: Thrift "hang ups" with no apparent reasonGaled Friedmann 2012-02-01, 09:00
Hi,
Thanks for replying! Answers to your questions: 1. I've taken a dump from the HMaster when we felt some timeouts, I hope that's what you're looking for, attached. 2. The timeout occurs around 10-12 hours after the ZK established the connection with the Thrift server so it's not immediate. On the Thrift logs you see that nothing happened and only see the timeouts on the ZK logs. Actually we hadn't had errors in the last 15 hours nor ZK timeouts for Thrift but it'll happen again I'm sure.. 3. The lease expiration happens all the time, we're using mostly JRuby scripts and closing the scans when we're done. Thanks again, Galed. On Tue, Jan 31, 2012 at 10:51 PM, Stack <[EMAIL PROTECTED]> wrote: > On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann > <[EMAIL PROTECTED]> wrote: > > Lately we're having weird issues with Thrift, after several hours the > > Thrift server "hangs" - the scripts that are using it to access HBase get > > connection timeouts, we're also using Heroku and ruby on rails apps that > > use Thrift and they simply get stuck. Only when restarting the Thrift > > process everything goes back to normal. > > > > Can you thread dump the thrift server when its all hung up? > > Have you enabled > > > > 2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn: > > Established session 0x1352a393d18051e with negotiated timeout 90000 for > > client /10.217.55.193:35940 > > 2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.ZooKeeperServer: > > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded > > 2012-01-30 10:52:28,001 INFO > > org.apache.zookeeper.server.PrepRequestProcessor: Processed session > > termination for sessionid: 0x1352a393d18051b > > ZK is establishing a session w/ 90second timeout and then timing out > immediately? > > > > > > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server > > listener on 60020: readAndProcess threw exception java.io.IOException: > > Connection rese > > t by peer. Count of bytes read: 0 > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcher.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) > > at sun.nio.ch.IOUtil.read(IOUtil.java:210) > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:619) > > 2012-01-30 10:52:24,016 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > > -4511393305838866925 lease expired > > 2012-01-30 10:52:24,016 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > > -5818959718437063034 lease expired > > 2012-01-30 10:52:24,016 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner > > -1408921590864341720 lease expired > > > > Client went away? All the lease expireds happen always or just around > time of the hangup (You are closing scanners when done?) > > St.Ack >
-
Re: Thrift "hang ups" with no apparent reasonStack 2012-02-01, 16:57
On Wed, Feb 1, 2012 at 1:00 AM, Galed Friedmann
<[EMAIL PROTECTED]> wrote: > 1. I've taken a dump from the HMaster when we felt some timeouts, I hope > that's what you're looking for, attached. I was looking for dumps of the hung up thrift server. The master dump shows it idle. > 2. The timeout occurs around 10-12 hours after the ZK established the > connection with the Thrift server so it's not immediate. On the Thrift logs > you see that nothing happened and only see the timeouts on the ZK logs. > Actually we hadn't had errors in the last 15 hours nor ZK timeouts for > Thrift but it'll happen again I'm sure.. OK. Thread dump it when its hung up. Thrift is getting stuck going against the cluster it seems. How many gateways are you running? Run more? > 3. The lease expiration happens all the time, we're using mostly JRuby > scripts and closing the scans when we're done. > Could it be the client is taking a long time to get back to the server? Or maybe the server is taking long time to respond because its heavily loaded (is it?). St.Ack > Thanks again, > Galed. > > > On Tue, Jan 31, 2012 at 10:51 PM, Stack <[EMAIL PROTECTED]> wrote: >> >> On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann >> <[EMAIL PROTECTED]> wrote: >> > Lately we're having weird issues with Thrift, after several hours the >> > Thrift server "hangs" - the scripts that are using it to access HBase >> > get >> > connection timeouts, we're also using Heroku and ruby on rails apps that >> > use Thrift and they simply get stuck. Only when restarting the Thrift >> > process everything goes back to normal. >> > >> >> Can you thread dump the thrift server when its all hung up? >> >> Have you enabled >> >> >> > 2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn: >> > Established session 0x1352a393d18051e with negotiated timeout 90000 for >> > client /10.217.55.193:35940 >> > 2012-01-30 10:52:28,001 INFO >> > org.apache.zookeeper.server.ZooKeeperServer: >> > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded >> > 2012-01-30 10:52:28,001 INFO >> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session >> > termination for sessionid: 0x1352a393d18051b >> >> ZK is establishing a session w/ 90second timeout and then timing out >> immediately? >> >> >> >> >> > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC >> > Server >> > listener on 60020: readAndProcess threw exception java.io.IOException: >> > Connection rese >> > t by peer. Count of bytes read: 0 >> > java.io.IOException: Connection reset by peer >> > at sun.nio.ch.FileDispatcher.read0(Native Method) >> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) >> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) >> > at sun.nio.ch.IOUtil.read(IOUtil.java:210) >> > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >> > at >> > >> > org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359) >> > at >> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900) >> > at >> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522) >> > at >> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316) >> > at >> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> > at >> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> > at java.lang.Thread.run(Thread.java:619) >> > 2012-01-30 10:52:24,016 INFO >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner >> > -4511393305838866925 lease expired >> > 2012-01-30 10:52:24,016 INFO >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner >> > -5818959718437063034 lease expired >> > 2012-01-30 10:52:24,016 INFO >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-
Re: Thrift "hang ups" with no apparent reasonGaled Friedmann 2012-02-01, 17:08
Hi,
It doesn't look like the servers are loaded, we're not passing that much traffic though the cluster at the moment. Can you explain how to take the dump from the Thrift server? I couldn't find how to do that. At the moment we have only 1 Thrift gateway, I'm going to add some more with load balancing. Thanks again. On Wed, Feb 1, 2012 at 6:57 PM, Stack <[EMAIL PROTECTED]> wrote: > On Wed, Feb 1, 2012 at 1:00 AM, Galed Friedmann > <[EMAIL PROTECTED]> wrote: > > 1. I've taken a dump from the HMaster when we felt some timeouts, I hope > > that's what you're looking for, attached. > > I was looking for dumps of the hung up thrift server. > > The master dump shows it idle. > > > 2. The timeout occurs around 10-12 hours after the ZK established the > > connection with the Thrift server so it's not immediate. On the Thrift > logs > > you see that nothing happened and only see the timeouts on the ZK logs. > > Actually we hadn't had errors in the last 15 hours nor ZK timeouts for > > Thrift but it'll happen again I'm sure.. > > OK. Thread dump it when its hung up. Thrift is getting stuck going > against the cluster it seems. How many gateways are you running? Run > more? > > > 3. The lease expiration happens all the time, we're using mostly JRuby > > scripts and closing the scans when we're done. > > > > Could it be the client is taking a long time to get back to the > server? Or maybe the server is taking long time to respond because > its heavily loaded (is it?). > > St.Ack > > > Thanks again, > > Galed. > > > > > > On Tue, Jan 31, 2012 at 10:51 PM, Stack <[EMAIL PROTECTED]> wrote: > >> > >> On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann > >> <[EMAIL PROTECTED]> wrote: > >> > Lately we're having weird issues with Thrift, after several hours the > >> > Thrift server "hangs" - the scripts that are using it to access HBase > >> > get > >> > connection timeouts, we're also using Heroku and ruby on rails apps > that > >> > use Thrift and they simply get stuck. Only when restarting the Thrift > >> > process everything goes back to normal. > >> > > >> > >> Can you thread dump the thrift server when its all hung up? > >> > >> Have you enabled > >> > >> > >> > 2012-01-30 10:52:08,823 INFO > org.apache.zookeeper.server.NIOServerCnxn: > >> > Established session 0x1352a393d18051e with negotiated timeout 90000 > for > >> > client /10.217.55.193:35940 > >> > 2012-01-30 10:52:28,001 INFO > >> > org.apache.zookeeper.server.ZooKeeperServer: > >> > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded > >> > 2012-01-30 10:52:28,001 INFO > >> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session > >> > termination for sessionid: 0x1352a393d18051b > >> > >> ZK is establishing a session w/ 90second timeout and then timing out > >> immediately? > >> > >> > >> > >> > >> > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC > >> > Server > >> > listener on 60020: readAndProcess threw exception java.io.IOException: > >> > Connection rese > >> > t by peer. Count of bytes read: 0 > >> > java.io.IOException: Connection reset by peer > >> > at sun.nio.ch.FileDispatcher.read0(Native Method) > >> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > >> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237) > >> > at sun.nio.ch.IOUtil.read(IOUtil.java:210) > >> > at > sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522) > >> > at > >> > > >> > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316) > >> > at > >> > > >> > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
-
Re: Thrift "hang ups" with no apparent reasonStack 2012-02-01, 17:23
On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann
<[EMAIL PROTECTED]> wrote: > Can you explain how to take the dump from the Thrift server? I couldn't > find how to do that. > Try this: http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx > At the moment we have only 1 Thrift gateway, I'm going to add some more > with load balancing. > At a minimum, it might put off the hang. St.Ack
-
Re: Thrift "hang ups" with no apparent reasonGaled Friedmann 2012-02-02, 07:59
Hi again,
Moved one of the services to another Thrift gateway and still got timeouts from that service, something tells me even load balancing won't help. I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the additional server we brought up, the other 2 files are from the Thrift that is running on the HMaster. Thanks again for the help and patience. On Wed, Feb 1, 2012 at 7:23 PM, Stack <[EMAIL PROTECTED]> wrote: > On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann > <[EMAIL PROTECTED]> wrote: > > Can you explain how to take the dump from the Thrift server? I couldn't > > find how to do that. > > > > Try this: > http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx > > > At the moment we have only 1 Thrift gateway, I'm going to add some more > > with load balancing. > > > > At a minimum, it might put off the hang. > St.Ack >
-
Re: Thrift "hang ups" with no apparent reasonyuzhihong@... 2012-02-02, 08:57
I don't think the attachments went through.
Can you find some other place to upload the files ? Thanks On Feb 1, 2012, at 11:59 PM, Galed Friedmann <[EMAIL PROTECTED]> wrote: > Hi again, > Moved one of the services to another Thrift gateway and still got timeouts from that service, something tells me even load balancing won't help. > > I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the additional server we brought up, the other 2 files are from the Thrift that is running on the HMaster. > > Thanks again for the help and patience. > > On Wed, Feb 1, 2012 at 7:23 PM, Stack <[EMAIL PROTECTED]> wrote: > On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann > <[EMAIL PROTECTED]> wrote: > > Can you explain how to take the dump from the Thrift server? I couldn't > > find how to do that. > > > > Try this: http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx > > > At the moment we have only 1 Thrift gateway, I'm going to add some more > > with load balancing. > > > > At a minimum, it might put off the hang. > St.Ack >
-
Re: Thrift "hang ups" with no apparent reasonGaled Friedmann 2012-02-02, 09:33
Uploaded to pastebin:
http://pastebin.com/YAHLyEMV http://pastebin.com/HuAhypsU http://pastebin.com/BupMyENi Thanks On Thu, Feb 2, 2012 at 10:57 AM, <[EMAIL PROTECTED]> wrote: > I don't think the attachments went through. > Can you find some other place to upload the files ? > > Thanks > > > > On Feb 1, 2012, at 11:59 PM, Galed Friedmann <[EMAIL PROTECTED]> > wrote: > > > Hi again, > > Moved one of the services to another Thrift gateway and still got > timeouts from that service, something tells me even load balancing won't > help. > > > > I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the > additional server we brought up, the other 2 files are from the Thrift that > is running on the HMaster. > > > > Thanks again for the help and patience. > > > > On Wed, Feb 1, 2012 at 7:23 PM, Stack <[EMAIL PROTECTED]> wrote: > > On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann > > <[EMAIL PROTECTED]> wrote: > > > Can you explain how to take the dump from the Thrift server? I couldn't > > > find how to do that. > > > > > > > Try this: > http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx > > > > > At the moment we have only 1 Thrift gateway, I'm going to add some more > > > with load balancing. > > > > > > > At a minimum, it might put off the hang. > > St.Ack > > >
-
Re: Thrift "hang ups" with no apparent reasonJean-Daniel Cryans 2012-02-02, 19:27
It seems like the thrift servers are doing "something", I see they are
reading inputs from your application and one is scanning. Earlier you mentioned that setting bigger heaps only delayed the issue, so it seems there's a memory leak. Which HBase version are you using? Earlier 0.90 versions had some issues with connection handling but we've been running in production with Thrift for ages and didn't see this issue (we migrated from the 0.89 serie to 0.90.2 so maybe you are using something older?). Did you enable GC logging? When it "gets stuck", I'm pretty sure that the GC log is filled with multi-seconds Full GCs. Finally, if you are using a recent version and do see heavy GCing, try to get a heap dump and see where the memory is allocated. "jvisualvm" can help you doing that if you don't feel like paying for jprofiler. Hope this helps, J-D On Thu, Feb 2, 2012 at 1:33 AM, Galed Friedmann <[EMAIL PROTECTED]> wrote: > Uploaded to pastebin: > http://pastebin.com/YAHLyEMV > http://pastebin.com/HuAhypsU > http://pastebin.com/BupMyENi > > Thanks > > On Thu, Feb 2, 2012 at 10:57 AM, <[EMAIL PROTECTED]> wrote: > >> I don't think the attachments went through. >> Can you find some other place to upload the files ? >> >> Thanks >> >> >> >> On Feb 1, 2012, at 11:59 PM, Galed Friedmann <[EMAIL PROTECTED]> >> wrote: >> >> > Hi again, >> > Moved one of the services to another Thrift gateway and still got >> timeouts from that service, something tells me even load balancing won't >> help. >> > >> > I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the >> additional server we brought up, the other 2 files are from the Thrift that >> is running on the HMaster. >> > >> > Thanks again for the help and patience. >> > >> > On Wed, Feb 1, 2012 at 7:23 PM, Stack <[EMAIL PROTECTED]> wrote: >> > On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann >> > <[EMAIL PROTECTED]> wrote: >> > > Can you explain how to take the dump from the Thrift server? I couldn't >> > > find how to do that. >> > > >> > >> > Try this: >> http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx >> > >> > > At the moment we have only 1 Thrift gateway, I'm going to add some more >> > > with load balancing. >> > > >> > >> > At a minimum, it might put off the hang. >> > St.Ack >> > >> |