|
Srikanth P. Shreenivas
2011-08-18, 14:37
Stack
2011-08-18, 18:33
Srikanth P. Shreenivas
2011-08-18, 18:49
Srikanth P. Shreenivas
2011-08-18, 18:51
Srikanth P. Shreenivas
2011-08-18, 19:09
Shrijeet Paliwal
2011-08-18, 20:21
Srikanth P. Shreenivas
2011-08-19, 20:27
Srikanth P. Shreenivas
2011-08-20, 12:56
Jean-Daniel Cryans
2011-08-21, 21:02
Ramkrishna S Vasudevan
2011-08-22, 06:25
Srikanth P. Shreenivas
2011-08-22, 10:13
Srikanth P. Shreenivas
2011-08-22, 11:58
Srikanth P. Shreenivas
2011-08-22, 13:57
|
-
Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-18, 14:37
Hi,
We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at: java.io.IOException: Giving up trying to get region server: thread is interrupted. at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal. I tried looking up below functions, but at this point could not make much out of it. https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java - getRegionServerWithRetries starts at Line 1233 https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java Htable.get starts at Line 611. If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption. We have seen this issue in two of our HBase Clusters, where load is quite less. We have 20 reads per minute, we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop). We are using CDH3. Thanks, Srikanth ________________________________ http://www.mindtree.com/email/disclaimer.html
-
Re: Query regarding HTable.get and timeoutsStack 2011-08-18, 18:33
Is your client running inside a container of some form and could the
container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas <[EMAIL PROTECTED]> wrote: > Hi, > > We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at: > > java.io.IOException: Giving up trying to get region server: thread is interrupted. > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > > > When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal. > > I tried looking up below functions, but at this point could not make much out of it. > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java - getRegionServerWithRetries starts at Line 1233 > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java Htable.get starts at Line 611. > > > If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption. > > We have seen this issue in two of our HBase Clusters, where load is quite less. We have 20 reads per minute, we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop). We are using CDH3. > > Thanks, > Srikanth > > ________________________________ > > http://www.mindtree.com/email/disclaimer.html >
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-18, 18:49
Hi Stack,
Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] on behalf of Stack [[EMAIL PROTECTED]] Sent: Friday, August 19, 2011 12:03 AM To: [EMAIL PROTECTED] Subject: Re: Query regarding HTable.get and timeouts Is your client running inside a container of some form and could the container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas <[EMAIL PROTECTED]> wrote: > Hi, > > We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at: > > java.io.IOException: Giving up trying to get region server: thread is interrupted. > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > > > When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal. > > I tried looking up below functions, but at this point could not make much out of it. > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java - getRegionServerWithRetries starts at Line 1233 > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java Htable.get starts at Line 611. > > > If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption. > > We have seen this issue in two of our HBase Clusters, where load is quite less. We have 20 reads per minute, we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop). We are using CDH3. > > Thanks, > Srikanth > > ________________________________ > > http://www.mindtree.com/email/disclaimer.html >
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-18, 18:51
Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:19 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Hi Stack, Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] on behalf of Stack [[EMAIL PROTECTED]] Sent: Friday, August 19, 2011 12:03 AM To: [EMAIL PROTECTED] Subject: Re: Query regarding HTable.get and timeouts Is your client running inside a container of some form and could the container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas <[EMAIL PROTECTED]> wrote: > Hi, > > We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at: > > java.io.IOException: Giving up trying to get region server: thread is interrupted. > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > > > When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal. > > I tried looking up below functions, but at this point could not make much out of it. > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java - getRegionServerWithRetries starts at Line 1233 > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java Htable.get starts at Line 611. > > > If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption. > > We have seen this issue in two of our HBase Clusters, where load is quite less. We have 20 reads per minute, we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop). We are using CDH3. > > Thanks, > Srikanth > > ________________________________ > > http://www.mindtree.com/email/disclaimer.html >
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-18, 19:09
My apologies, I may not be reading the code right.
You are right, it is GridGain timeout that is making the line 1255 to execute. However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task. The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is. Regards, Srikanth ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:21 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:19 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Hi Stack, Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] on behalf of Stack [[EMAIL PROTECTED]] Sent: Friday, August 19, 2011 12:03 AM To: [EMAIL PROTECTED] Subject: Re: Query regarding HTable.get and timeouts Is your client running inside a container of some form and could the container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas <[EMAIL PROTECTED]> wrote: > Hi, > > We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at: > > java.io.IOException: Giving up trying to get region server: thread is interrupted. > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > > > When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal. > > I tried looking up below functions, but at this point could not make much out of it. > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java - getRegionServerWithRetries starts at Line 1233 > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java Htable.get starts at Line 611. > > > If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption. > > We have seen this issue in two of our HBase Clusters, where load is quite less. We have 20 reads per minute, we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop). We are using CDH3.
-
Re: Query regarding HTable.get and timeoutsShrijeet Paliwal 2011-08-18, 20:21
It follows exponential back off. Each pause is longer than the last one and
all adds up close to 600. On Thu, Aug 18, 2011 at 12:09 PM, Srikanth P. Shreenivas < [EMAIL PROTECTED]> wrote: > My apologies, I may not be reading the code right. > > You are right, it is GridGain timeout that is making the line 1255 to > execute. > However, the question is what would make a HTable.get() to take close to 10 > minutes to induce a timeout in GridGain task. > > The value of numRetries at line 1236 should be 10 (default) and if we go > with default value of HConstants.RETRY_BACKOFF, then, sleep time added with > all retries will be only 61 seconds, and not close to 600 seconds as the > case in our code is. > > > Regards, > Srikanth > > > ________________________________________ > From: Srikanth P. Shreenivas > Sent: Friday, August 19, 2011 12:21 AM > To: [EMAIL PROTECTED] > Subject: RE: Query regarding HTable.get and timeouts > > Please note that line numbers I am referencing are from the file : > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java > > > ________________________________________ > From: Srikanth P. Shreenivas > Sent: Friday, August 19, 2011 12:19 AM > To: [EMAIL PROTECTED] > Subject: RE: Query regarding HTable.get and timeouts > > Hi Stack, > > Thanks a lot for your reply. It's always a comforting feeling to see very > active community and especially your prompt replies to the queries. > > Yes, I am running it in as GridGain task, so it runs it GridGain's thread > pool. In this case, we can imaging GridGain as something that hands off > works to various worker threads and waits asynhronously for it complete. I > have 10 minute timeout after which GridGain would consider work as timed > out. > > What we are observing is that our tasks are timeing out at 10 minute > boundary, and delay seems to be caused by the part of the work which is > doing HTable.get. > > My suspicion is that Line 1255 in HConnectionManager.java is calling the > Thread.currentThread().interrupt(), due to which the GridGain thread kind of > stops doing what it was meant to do, and never responsds to master node > resulting in timeout in master. > > In order for line 1255 to execute, we will have to assume that all retries > were exhausted. > Hence, my query that what would cause a HTable.get() to get into a > situation wherein > HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets > to line 1255. > > > Regards, > Srikanth > > ________________________________________ > From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] on behalf of Stack [ > [EMAIL PROTECTED]] > Sent: Friday, August 19, 2011 12:03 AM > To: [EMAIL PROTECTED] > Subject: Re: Query regarding HTable.get and timeouts > > Is your client running inside a container of some form and could the > container be doing the interrupting? I've not come across > client-side thread interrupts before. > St.Ack > > On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas > <[EMAIL PROTECTED]> wrote: > > Hi, > > > > We are experiencing an issue in our HBase Cluster wherein some of the > gets are timing outs at: > > > > java.io.IOException: Giving up trying to get region server: thread is > interrupted. > > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016) > > at > org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > > > > > > When we look at the logs of master, zookeeper and region servers, there > is nothing that indicates anything abnormal. > > > > I tried looking up below functions, but at this point could not make much > out of it. > > > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java - getRegionServerWithRetries starts at Line 1233 > > > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java Htable.get starts at Line 611.
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-19, 20:27
I did some tests today. In our QA setup, we dont see any issues. I ran more than 100,000 operations in our QA setup in 1 hour with all HBase reads/writes working as expected.
However, in our production setup, I regularly see the issue wherein client thread gets interrupted because HTable.get() call does not return. It is possible that it is taking more than 10 minutes due to https://issues.apache.org/jira/browse/HBASE-2121. However, I am not able to figure out what is causing this. Logs of Hbase as well as Hadoop servers seems to be quite normal. The cluster on which we are seeing this issue has no writes happening, and I do see this issue after about 10 operations. One strange thing I noticed though is that zookeeper logs were truncated and had only entries for last 15-20 minutes instead of complete day. This was around 8PM, so it had not rolled over. I had earlier too asked a query on same issue (http://www.mail-archive.com/[EMAIL PROTECTED]/msg09904.html). The changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our QA setup. However, same changes in production does not seem to have similar effect. If you can provide any clues on how should we go about investigating this issue, that will be real help. Regards, Srikanth. ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:39 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts My apologies, I may not be reading the code right. You are right, it is GridGain timeout that is making the line 1255 to execute. However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task. The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is. Regards, Srikanth ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:21 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:19 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Hi Stack, Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] on behalf of Stack [[EMAIL PROTECTED]] Sent: Friday, August 19, 2011 12:03 AM To: [EMAIL PROTECTED] Subject: Re: Query regarding HTable.get and timeouts Is your client running inside a container of some form and could the container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas <[EMAIL PROTECTED]> wrote:
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-20, 12:56
Further in this investigation, we enabled the debug logs on client side.
We are observing that client is trying to root region, and is continuously failing to do so. The logs are filled with entries like this: 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null Client keeps retrying and retries get exhausted. Complete logs are available here: https://gist.github.com/1159064 including logs of master, zookeeper and region servers. If you can please look at the logs and provide some inputs on this issue, then it will be really helpful. We are really not sure why client is failing to get root regions from the server. Any guidance will be greatly appreciated. Thanks a lot, Srikanth -----Original Message----- From: Srikanth P. Shreenivas Sent: Saturday, August 20, 2011 1:57 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts I did some tests today. In our QA setup, we dont see any issues. I ran more than 100,000 operations in our QA setup in 1 hour with all HBase reads/writes working as expected. However, in our production setup, I regularly see the issue wherein client thread gets interrupted because HTable.get() call does not return. It is possible that it is taking more than 10 minutes due to https://issues.apache.org/jira/browse/HBASE-2121. However, I am not able to figure out what is causing this. Logs of Hbase as well as Hadoop servers seems to be quite normal. The cluster on which we are seeing this issue has no writes happening, and I do see this issue after about 10 operations. One strange thing I noticed though is that zookeeper logs were truncated and had only entries for last 15-20 minutes instead of complete day. This was around 8PM, so it had not rolled over. I had earlier too asked a query on same issue (http://www.mail-archive.com/[EMAIL PROTECTED]/msg09904.html). The changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our QA setup. However, same changes in production does not seem to have similar effect. If you can provide any clues on how should we go about investigating this issue, that will be real help. Regards, Srikanth. ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:39 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts My apologies, I may not be reading the code right. You are right, it is GridGain timeout that is making the line 1255 to execute. However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task. The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is. Regards, Srikanth ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:21 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:19 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Hi Stack, Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] on behalf of Stack [[EMAIL PROTECTED]] Sent: Friday, August 19, 2011 12:03 AM To: [EMAIL PROTECTED] Subject: Re: Query regarding HTable.get and timeouts Is your client running inside a container of some form and could the container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas <[EMAIL PROTECTED]> wrote:
-
Re: Query regarding HTable.get and timeoutsJean-Daniel Cryans 2011-08-21, 21:02
Yeah that null message isn't really helpful :)
So one thing that might be helpful would be to know who DC1AuthDFSC1D3 is, since you identified the logs as "Region server n". Then look at the master's web UI and see where -ROOT- is assigned. Is it also DC1AuthDFSC1D3? If so, then I would proceed by checking if there's a firewall in between the client and the cluster, also I would make sure that the client is running the same version as the server. J-D On Sat, Aug 20, 2011 at 5:56 AM, Srikanth P. Shreenivas <[EMAIL PROTECTED]> wrote: > Further in this investigation, we enabled the debug logs on client side. > > We are observing that client is trying to root region, and is continuously failing to do so. The logs are filled with entries like this: > > 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 > 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 > because: null > > Client keeps retrying and retries get exhausted. > > > Complete logs are available here: https://gist.github.com/1159064 including logs of master, zookeeper and region servers. > > > If you can please look at the logs and provide some inputs on this issue, then it will be really helpful. > We are really not sure why client is failing to get root regions from the server. Any guidance will be greatly appreciated. > > > Thanks a lot, > Srikanth
-
RE: Query regarding HTable.get and timeoutsRamkrishna S Vasudevan 2011-08-22, 06:25
Hi Srikanth
I went through your logs.. not much info is present in that. Could you give the logs which shows what happened when the master and RS started. and also to whom the ROOT and META table got assigned? Regards Ram -----Original Message----- From: Srikanth P. Shreenivas [mailto:[EMAIL PROTECTED]] Sent: Saturday, August 20, 2011 6:27 PM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Further in this investigation, we enabled the debug logs on client side. We are observing that client is trying to root region, and is continuously failing to do so. The logs are filled with entries like this: 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpl ementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null Client keeps retrying and retries get exhausted. Complete logs are available here: https://gist.github.com/1159064 including logs of master, zookeeper and region servers. If you can please look at the logs and provide some inputs on this issue, then it will be really helpful. We are really not sure why client is failing to get root regions from the server. Any guidance will be greatly appreciated. Thanks a lot, Srikanth -----Original Message----- From: Srikanth P. Shreenivas Sent: Saturday, August 20, 2011 1:57 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts I did some tests today. In our QA setup, we dont see any issues. I ran more than 100,000 operations in our QA setup in 1 hour with all HBase reads/writes working as expected. However, in our production setup, I regularly see the issue wherein client thread gets interrupted because HTable.get() call does not return. It is possible that it is taking more than 10 minutes due to https://issues.apache.org/jira/browse/HBASE-2121. However, I am not able to figure out what is causing this. Logs of Hbase as well as Hadoop servers seems to be quite normal. The cluster on which we are seeing this issue has no writes happening, and I do see this issue after about 10 operations. One strange thing I noticed though is that zookeeper logs were truncated and had only entries for last 15-20 minutes instead of complete day. This was around 8PM, so it had not rolled over. I had earlier too asked a query on same issue (http://www.mail-archive.com/[EMAIL PROTECTED]/msg09904.html). The changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our QA setup. However, same changes in production does not seem to have similar effect. If you can provide any clues on how should we go about investigating this issue, that will be real help. Regards, Srikanth. ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:39 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts My apologies, I may not be reading the code right. You are right, it is GridGain timeout that is making the line 1255 to execute. However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task. The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is. Regards, Srikanth ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:21 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h base/client/HConnectionManager.java ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:19 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Hi Stack, Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] on behalf of Stack [[EMAIL PROTECTED]] Sent: Friday, August 19, 2011 12:03 AM To: [EMAIL PROTECTED] Subject: Re: Query regarding HTable.get and timeouts Is your client running inside a container of some form and could the container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-22, 10:13
Yes, DC1AuthDFSC1D3 hosts the root region. It is also region server 3. DC1AuthDFSC1D1, DC1AuthDFSC1D2, DC1AuthDFSC1D3 and DC1AuthDFSC1D4 are 4 region servers in our cluster.
****************************************** I checked with Data Centre team, they confirmed that there is no firewall in the network where hbase servers and client applications is running. ****************************************** Regarding client and server running different versions, they are running same versions. If there was version mismatch, I guess we would be seeing the issue for all the reads. Here we see the issue only for few reads, one in 10-15 reads fail this way. We do use same hbase, zookeeper and hadoop jars as found in the HBase distribution. Strangely enough, I saw the below for the first time today, and it has occurred only once so far. 10.3.48.61 is the IP address where our client app is running. 2011-08-22 11:46:55,905 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7625 got version 6 expected version 3 2011-08-22 11:46:57,542 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7626 got version 6 expected version 3 2011-08-22 11:46:58,483 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7627 got version 6 expected version 3 2011-08-22 11:46:59,335 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7628 got version 6 expected version 3 2011-08-22 11:47:00,164 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7629 got version 6 expected version 3 2011-08-22 11:47:00,972 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7630 got version 6 expected version 3 2011-08-22 11:47:01,768 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7631 got version 6 expected version 3 2011-08-22 11:47:02,648 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7632 got version 6 expected version 3 ****************************************** I enabled debug logging level for all classes today. Here is the exception associated with "null" messages. *** Do you think that some thread in client is doing interrupt() resulting in "java.nio.channels.ClosedByInterruptException" below? *** 2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null 2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient] - Connecting to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient] - closing ipc connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020: null java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy41.getClosestRowBefore(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:719) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:589) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:558) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:687) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:593) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:564) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:415) at org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCallable.java:57) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1002) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findEntities(HBaseHandler.java:271) at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findObject(HBaseHandler.java:156) at in.gov.uidai.platform.impl.persistence.provider.AbstractPers
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-22, 11:58
Hi,
Going with the assumption that our client threads may getting interrupted and it may not be an hbase issue, we rebuilt our client application without GridGain. Earlier our code was being executed by GridGain's thread pool, but now we made the app to run in raw Tomcat. I am very glad to say that we are not seeing any " java.nio.channels.ClosedByInterruptException". My client app is working just great, performing its hbase read/writes as expected. Thanks a lot for all the help. Regards, Srikanth PS: I must say that HBase community is really great. Really appreciate all the inputs and suggestions. -----Original Message----- From: Srikanth P. Shreenivas Sent: Monday, August 22, 2011 3:43 PM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Yes, DC1AuthDFSC1D3 hosts the root region. It is also region server 3. DC1AuthDFSC1D1, DC1AuthDFSC1D2, DC1AuthDFSC1D3 and DC1AuthDFSC1D4 are 4 region servers in our cluster. ****************************************** I checked with Data Centre team, they confirmed that there is no firewall in the network where hbase servers and client applications is running. ****************************************** Regarding client and server running different versions, they are running same versions. If there was version mismatch, I guess we would be seeing the issue for all the reads. Here we see the issue only for few reads, one in 10-15 reads fail this way. We do use same hbase, zookeeper and hadoop jars as found in the HBase distribution. Strangely enough, I saw the below for the first time today, and it has occurred only once so far. 10.3.48.61 is the IP address where our client app is running. 2011-08-22 11:46:55,905 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7625 got version 6 expected version 3 2011-08-22 11:46:57,542 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7626 got version 6 expected version 3 2011-08-22 11:46:58,483 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7627 got version 6 expected version 3 2011-08-22 11:46:59,335 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7628 got version 6 expected version 3 2011-08-22 11:47:00,164 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7629 got version 6 expected version 3 2011-08-22 11:47:00,972 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7630 got version 6 expected version 3 2011-08-22 11:47:01,768 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7631 got version 6 expected version 3 2011-08-22 11:47:02,648 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7632 got version 6 expected version 3 ****************************************** I enabled debug logging level for all classes today. Here is the exception associated with "null" messages. *** Do you think that some thread in client is doing interrupt() resulting in "java.nio.channels.ClosedByInterruptException" below? *** 2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null 2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient] - Connecting to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient] - closing ipc connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020: null java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy41.getClosestRowBefore(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:719) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:589) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:558) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:687) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:593) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectio
-
RE: Query regarding HTable.get and timeoutsSrikanth P. Shreenivas 2011-08-22, 13:57
Hi Ram,
Thanks for your help. We seemed to resolve the issue today by doing application changes. (Refer my other reply to this thread). Regards, Srikanth -----Original Message----- From: Ramkrishna S Vasudevan [mailto:[EMAIL PROTECTED]] Sent: Monday, August 22, 2011 11:56 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Hi Srikanth I went through your logs.. not much info is present in that. Could you give the logs which shows what happened when the master and RS started. and also to whom the ROOT and META table got assigned? Regards Ram -----Original Message----- From: Srikanth P. Shreenivas [mailto:[EMAIL PROTECTED]] Sent: Saturday, August 20, 2011 6:27 PM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Further in this investigation, we enabled the debug logs on client side. We are observing that client is trying to root region, and is continuously failing to do so. The logs are filled with entries like this: 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpl ementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null Client keeps retrying and retries get exhausted. Complete logs are available here: https://gist.github.com/1159064 including logs of master, zookeeper and region servers. If you can please look at the logs and provide some inputs on this issue, then it will be really helpful. We are really not sure why client is failing to get root regions from the server. Any guidance will be greatly appreciated. Thanks a lot, Srikanth -----Original Message----- From: Srikanth P. Shreenivas Sent: Saturday, August 20, 2011 1:57 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts I did some tests today. In our QA setup, we dont see any issues. I ran more than 100,000 operations in our QA setup in 1 hour with all HBase reads/writes working as expected. However, in our production setup, I regularly see the issue wherein client thread gets interrupted because HTable.get() call does not return. It is possible that it is taking more than 10 minutes due to https://issues.apache.org/jira/browse/HBASE-2121. However, I am not able to figure out what is causing this. Logs of Hbase as well as Hadoop servers seems to be quite normal. The cluster on which we are seeing this issue has no writes happening, and I do see this issue after about 10 operations. One strange thing I noticed though is that zookeeper logs were truncated and had only entries for last 15-20 minutes instead of complete day. This was around 8PM, so it had not rolled over. I had earlier too asked a query on same issue (http://www.mail-archive.com/[EMAIL PROTECTED]/msg09904.html). The changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our QA setup. However, same changes in production does not seem to have similar effect. If you can provide any clues on how should we go about investigating this issue, that will be real help. Regards, Srikanth. ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:39 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts My apologies, I may not be reading the code right. You are right, it is GridGain timeout that is making the line 1255 to execute. However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task. The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is. Regards, Srikanth ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:21 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h base/client/HConnectionManager.java ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:19 AM To: [EMAIL PROTECTED] Subject: RE: Query regarding HTable.get and timeouts Hi Stack, Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: saint. |