Thanks for the analysis.
Do you mind opening a Jira ?
On Jan 10, 2012, at 7:51 AM, Yves Langisch <[EMAIL PROTECTED]> wrote:
> Still happens with HBase 0.90.5/Hadoop 1.0.0. But I think I have some more insights on this topic. Following an up to date stack trace:
> at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:986)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2008)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> Caused by: java.lang.NullPointerException
> at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:881)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.addRowLock(HRegionServer.java:2018)
> at org.apache.hadoop.hbase.regionserver.HRegionServer.lockRow(HRegionServer.java:2004)
> ... 5 more
> After checking the source code I've noticed that the value which is going to be put into the HashMap can be null in the case where the waitForLock flag is true or the rowLockWaitDuration is expired (HRegion#internalObtainRowLock, line 2111ff). The latter I think happens in our case as we have heavy load hitting the server.
> IMHO this case should be handled somehow and must not lead to a NPE.
> On Dec 30, 2011, at 12:12 PM, Yves Langisch wrote:
>> Still happens but before I'm going to add some debugging information I'll try to deploy the new version 0.90.5.
>> On Dec 18, 2011, at 12:08 AM, Stack wrote:
>>> On Fri, Dec 16, 2011 at 8:20 AM, Yves Langisch <[EMAIL PROTECTED]> wrote:
>>>> I'm using the async hbase client (1.0) and there is no way to choose a lockId on my own:
>>>> return database.client().lockRow(
>>>> new RowLockRequest(TableManager.ID_TABLE_NAME, MAXID_ROW)).join();
>>>> Any ideas what else could be wrong here?
>>> Looking at the code on regionserver side,
>>> down around line 1994, its unlikely the region is null since we should
>>> throw NotServingRegionException if can't find region (and we check for
>>> null region name a few lines up) so maybe its something in the way we
>>> do obtainRowLock on line 1995?
>>> Any chance of your instrumenting the regionserver? Adding a bit of
>>> debugging and deploying the debugging regionserver?
>>> My guess is we haven't seen this before because not many use rowlocks
>>> (rowlocks if long-lived and lots of contending clients could freeze
>>> you out of the server; each client blocked waiting on rowlock to clear
>>> occupies a handler of which there are a bounded number).