|
Jack Levin
2011-04-21, 05:11
Stack
2011-04-21, 07:13
Jack Levin
2011-04-21, 07:47
Jack Levin
2011-04-25, 17:32
Jean-Daniel Cryans
2011-04-25, 17:37
Jack Levin
2011-04-25, 20:04
Jean-Daniel Cryans
2011-04-25, 20:15
|
-
REST servers locked up on single RS malfunction.Jack Levin 2011-04-21, 05:11
Hello, with 0.89 HBASE, we see the following, all REST servers get
locked on trying to connect to one of our RS servers, the error in the .out file on that Region Server looks like this: Exception in thread "pool-1-thread-3" java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Question is, how come the region server did not die after this but just hogged the REST connections? And what is pool1-thread-3 actually do? -Jack
-
Re: REST servers locked up on single RS malfunction.Stack 2011-04-21, 07:13
This looks like a bug. Elsewhere in the RPC you can register a
handler for OOME explicitly and we have a callback up into the regionserver where we will set that the server abort or stop dependent on type of OOME we see. In this case it looks like on OOME we just throw and the then all the executors fill so no more executors available to process requests (This is my current accessment -- it could be a different one by morning). The root cause would look to be a big put. Could that be the case. On the naming, that looks to be the default naming of executor threads done by the hosting executorservice. St.Ack On Wed, Apr 20, 2011 at 10:11 PM, Jack Levin <[EMAIL PROTECTED]> wrote: > Hello, with 0.89 HBASE, we see the following, all REST servers get > locked on trying to connect to one of our RS servers, the error in the > .out file on that Region Server looks like this: > > Exception in thread "pool-1-thread-3" java.lang.OutOfMemoryError: Java > heap space > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) > at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) > at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) > at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) > at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > > Question is, how come the region server did not die after this but > just hogged the REST connections? And what is pool1-thread-3 actually > do? > > -Jack >
-
Re: REST servers locked up on single RS malfunction.Jack Levin 2011-04-21, 07:47
Shouldn't the RS just shutdown then? Because it stays half alive and
none of the puts succeed. Also the oome happen right after flush/compaction/split... so clearly the RS was busy, and it could be just a matter of hitting Heap ceiling perhaps. -Jack On Thu, Apr 21, 2011 at 12:13 AM, Stack <[EMAIL PROTECTED]> wrote: > This looks like a bug. Elsewhere in the RPC you can register a > handler for OOME explicitly and we have a callback up into the > regionserver where we will set that the server abort or stop dependent > on type of OOME we see. In this case it looks like on OOME we just > throw and the then all the executors fill so no more executors > available to process requests (This is my current accessment -- it > could be a different one by morning). > > The root cause would look to be a big put. Could that be the case. > > On the naming, that looks to be the default naming of executor threads > done by the hosting executorservice. > > St.Ack > > > On Wed, Apr 20, 2011 at 10:11 PM, Jack Levin <[EMAIL PROTECTED]> wrote: >> Hello, with 0.89 HBASE, we see the following, all REST servers get >> locked on trying to connect to one of our RS servers, the error in the >> .out file on that Region Server looks like this: >> >> Exception in thread "pool-1-thread-3" java.lang.OutOfMemoryError: Java >> heap space >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) >> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:619) >> >> Question is, how come the region server did not die after this but >> just hogged the REST connections? And what is pool1-thread-3 actually >> do? >> >> -Jack >> >
-
Re: REST servers locked up on single RS malfunction.Jack Levin 2011-04-25, 17:32
Stack:
Exception in thread "pool-1-thread-9" java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Btw, is this put or read? Perhaps we are crashing on some sort of large read? -Jack On Thu, Apr 21, 2011 at 12:47 AM, Jack Levin <[EMAIL PROTECTED]> wrote: > Shouldn't the RS just shutdown then? Because it stays half alive and > none of the puts succeed. Also the oome happen right after > flush/compaction/split... so clearly the RS was busy, and it could be > just a matter of hitting Heap ceiling perhaps. > > -Jack > > On Thu, Apr 21, 2011 at 12:13 AM, Stack <[EMAIL PROTECTED]> wrote: >> This looks like a bug. Elsewhere in the RPC you can register a >> handler for OOME explicitly and we have a callback up into the >> regionserver where we will set that the server abort or stop dependent >> on type of OOME we see. In this case it looks like on OOME we just >> throw and the then all the executors fill so no more executors >> available to process requests (This is my current accessment -- it >> could be a different one by morning). >> >> The root cause would look to be a big put. Could that be the case. >> >> On the naming, that looks to be the default naming of executor threads >> done by the hosting executorservice. >> >> St.Ack >> >> >> On Wed, Apr 20, 2011 at 10:11 PM, Jack Levin <[EMAIL PROTECTED]> wrote: >>> Hello, with 0.89 HBASE, we see the following, all REST servers get >>> locked on trying to connect to one of our RS servers, the error in the >>> .out file on that Region Server looks like this: >>> >>> Exception in thread "pool-1-thread-3" java.lang.OutOfMemoryError: Java >>> heap space >>> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> at java.lang.Thread.run(Thread.java:619) >>> >>> Question is, how come the region server did not die after this but >>> just hogged the REST connections? And what is pool1-thread-3 actually >>> do? >>> >>> -Jack >>> >> >
-
Re: REST servers locked up on single RS malfunction.Jean-Daniel Cryans 2011-04-25, 17:37
Can't tell what it was because it OOME'd while reading whatever was coming in.
Did you bump the number of handlers in that cluster too? Because you might hit what we talked about in this jira: https://issues.apache.org/jira/browse/HBASE-3813 "Chatting w/ J-D this morning, he asked if the queues hold 'data'. The queues hold 'Calls'. Calls are the client request. They contain data. Jack had 2500 items queued. If each item to insert was 1MB, thats 25k * 1MB of memory that is outside of our generally accounting." So the higher the number of handlers the more memory could be used by the queues. J-D On Mon, Apr 25, 2011 at 10:32 AM, Jack Levin <[EMAIL PROTECTED]> wrote: > Stack: > > Exception in thread "pool-1-thread-9" java.lang.OutOfMemoryError: Java > heap space > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) > at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) > at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) > at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) > at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > > Btw, is this put or read? Perhaps we are crashing on some sort of large read? > > -Jack > > On Thu, Apr 21, 2011 at 12:47 AM, Jack Levin <[EMAIL PROTECTED]> wrote: >> Shouldn't the RS just shutdown then? Because it stays half alive and >> none of the puts succeed. Also the oome happen right after >> flush/compaction/split... so clearly the RS was busy, and it could be >> just a matter of hitting Heap ceiling perhaps. >> >> -Jack >> >> On Thu, Apr 21, 2011 at 12:13 AM, Stack <[EMAIL PROTECTED]> wrote: >>> This looks like a bug. Elsewhere in the RPC you can register a >>> handler for OOME explicitly and we have a callback up into the >>> regionserver where we will set that the server abort or stop dependent >>> on type of OOME we see. In this case it looks like on OOME we just >>> throw and the then all the executors fill so no more executors >>> available to process requests (This is my current accessment -- it >>> could be a different one by morning). >>> >>> The root cause would look to be a big put. Could that be the case. >>> >>> On the naming, that looks to be the default naming of executor threads >>> done by the hosting executorservice. >>> >>> St.Ack >>> >>> >>> On Wed, Apr 20, 2011 at 10:11 PM, Jack Levin <[EMAIL PROTECTED]> wrote: >>>> Hello, with 0.89 HBASE, we see the following, all REST servers get >>>> locked on trying to connect to one of our RS servers, the error in the >>>> .out file on that Region Server looks like this: >>>> >>>> Exception in thread "pool-1-thread-3" java.lang.OutOfMemoryError: Java >>>> heap space >>>> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) >>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) >>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) >>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) >>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>> at java.lang.Thread.run(Thread.java:619) >>>> >>>> Question is, how come the region server did not die after this but >>>> just hogged the REST connections? And what is pool1-thread-3 actually >>>> do? >>>> >>>> -Jack >>>> >>> >> >
-
Re: REST servers locked up on single RS malfunction.Jack Levin 2011-04-25, 20:04
thats a separate cluster, its barely getting any traffic so I don't
think queue would be an issue. We do however have very large files stored (file per row). So question is, if this is a GET that breaks things, how can we avoid it? -Jack On Mon, Apr 25, 2011 at 10:37 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Can't tell what it was because it OOME'd while reading whatever was coming in. > > Did you bump the number of handlers in that cluster too? Because you > might hit what we talked about in this jira: > https://issues.apache.org/jira/browse/HBASE-3813 > > "Chatting w/ J-D this morning, he asked if the queues hold 'data'. The > queues hold 'Calls'. Calls are the client request. They contain data. > Jack had 2500 items queued. If each item to insert was 1MB, thats 25k > * 1MB of memory that is outside of our generally accounting." > > So the higher the number of handlers the more memory could be used by > the queues. > > J-D > > On Mon, Apr 25, 2011 at 10:32 AM, Jack Levin <[EMAIL PROTECTED]> wrote: >> Stack: >> >> Exception in thread "pool-1-thread-9" java.lang.OutOfMemoryError: Java >> heap space >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) >> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) >> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> at java.lang.Thread.run(Thread.java:619) >> >> Btw, is this put or read? Perhaps we are crashing on some sort of large read? >> >> -Jack >> >> On Thu, Apr 21, 2011 at 12:47 AM, Jack Levin <[EMAIL PROTECTED]> wrote: >>> Shouldn't the RS just shutdown then? Because it stays half alive and >>> none of the puts succeed. Also the oome happen right after >>> flush/compaction/split... so clearly the RS was busy, and it could be >>> just a matter of hitting Heap ceiling perhaps. >>> >>> -Jack >>> >>> On Thu, Apr 21, 2011 at 12:13 AM, Stack <[EMAIL PROTECTED]> wrote: >>>> This looks like a bug. Elsewhere in the RPC you can register a >>>> handler for OOME explicitly and we have a callback up into the >>>> regionserver where we will set that the server abort or stop dependent >>>> on type of OOME we see. In this case it looks like on OOME we just >>>> throw and the then all the executors fill so no more executors >>>> available to process requests (This is my current accessment -- it >>>> could be a different one by morning). >>>> >>>> The root cause would look to be a big put. Could that be the case. >>>> >>>> On the naming, that looks to be the default naming of executor threads >>>> done by the hosting executorservice. >>>> >>>> St.Ack >>>> >>>> >>>> On Wed, Apr 20, 2011 at 10:11 PM, Jack Levin <[EMAIL PROTECTED]> wrote: >>>>> Hello, with 0.89 HBASE, we see the following, all REST servers get >>>>> locked on trying to connect to one of our RS servers, the error in the >>>>> .out file on that Region Server looks like this: >>>>> >>>>> Exception in thread "pool-1-thread-3" java.lang.OutOfMemoryError: Java >>>>> heap space >>>>> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) >>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) >>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) >>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) >>>>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) >>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
-
Re: REST servers locked up on single RS malfunction.Jean-Daniel Cryans 2011-04-25, 20:15
There's a good chance that if the region server started getting slow,
the requests from the REST servers would start piling up in the queues and finally blow out the memory. You could confirm that by looking at the GC logs before the OOME. Also when it died, it should a dumped a hprof file. If you have that file (should be a few GBs), it would be possible to tell what was using all that space. It would be interesting to see what happened in the logs before that, including the metrics dump, might give us a clue. Once we have a better understanding of what happened, we could look into finding the right solution. J-D On Mon, Apr 25, 2011 at 1:04 PM, Jack Levin <[EMAIL PROTECTED]> wrote: > thats a separate cluster, its barely getting any traffic so I don't > think queue would be an issue. We do however have very large files > stored (file per row). So question is, if this is a GET that breaks > things, how can we avoid it? > > -Jack > > On Mon, Apr 25, 2011 at 10:37 AM, Jean-Daniel Cryans > <[EMAIL PROTECTED]> wrote: >> Can't tell what it was because it OOME'd while reading whatever was coming in. >> >> Did you bump the number of handlers in that cluster too? Because you >> might hit what we talked about in this jira: >> https://issues.apache.org/jira/browse/HBASE-3813 >> >> "Chatting w/ J-D this morning, he asked if the queues hold 'data'. The >> queues hold 'Calls'. Calls are the client request. They contain data. >> Jack had 2500 items queued. If each item to insert was 1MB, thats 25k >> * 1MB of memory that is outside of our generally accounting." >> >> So the higher the number of handlers the more memory could be used by >> the queues. >> >> J-D >> >> On Mon, Apr 25, 2011 at 10:32 AM, Jack Levin <[EMAIL PROTECTED]> wrote: >>> Stack: >>> >>> Exception in thread "pool-1-thread-9" java.lang.OutOfMemoryError: Java >>> heap space >>> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:120) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) >>> at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> at java.lang.Thread.run(Thread.java:619) >>> >>> Btw, is this put or read? Perhaps we are crashing on some sort of large read? >>> >>> -Jack >>> >>> On Thu, Apr 21, 2011 at 12:47 AM, Jack Levin <[EMAIL PROTECTED]> wrote: >>>> Shouldn't the RS just shutdown then? Because it stays half alive and >>>> none of the puts succeed. Also the oome happen right after >>>> flush/compaction/split... so clearly the RS was busy, and it could be >>>> just a matter of hitting Heap ceiling perhaps. >>>> >>>> -Jack >>>> >>>> On Thu, Apr 21, 2011 at 12:13 AM, Stack <[EMAIL PROTECTED]> wrote: >>>>> This looks like a bug. Elsewhere in the RPC you can register a >>>>> handler for OOME explicitly and we have a callback up into the >>>>> regionserver where we will set that the server abort or stop dependent >>>>> on type of OOME we see. In this case it looks like on OOME we just >>>>> throw and the then all the executors fill so no more executors >>>>> available to process requests (This is my current accessment -- it >>>>> could be a different one by morning). >>>>> >>>>> The root cause would look to be a big put. Could that be the case. >>>>> >>>>> On the naming, that looks to be the default naming of executor threads >>>>> done by the hosting executorservice. >>>>> >>>>> St.Ack >>>>> >>>>> >>>>> On Wed, Apr 20, 2011 at 10:11 PM, Jack Levin <[EMAIL PROTECTED]> wrote: >>>>>> Hello, with 0.89 HBASE, we see the following, all REST servers get >> |