Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> perplexing HBase bug: looking for where to learn how to debug


Copy link to this message
-
RE: perplexing HBase bug: looking for where to learn how to debug
The first step to debugging HBase is usually going through the Master and RegionServer logs.  Sometimes it can be more art than science but a majority of our debugging is done with log analysis.

If you can find specific offending regions, you can parse through the logs looking for mentions of that region and see where things went wrong.

If you're just getting started with HBase, I would also recommend working with the latest 0.90RC as issues like you're seeing have been fixed since then.

JG

> -----Original Message-----
> From: Chet Murthy [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, January 05, 2011 10:38 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: perplexing HBase bug: looking for where to learn how to debug
>
>
> I've just started using hbase, and have encountered a perplexing bug.
> The bug occurs on one set of Linux boxes, and not on another set, even
> though they're both x86_64 Linux, and both are running -identical- JVM
> releases.
>
> I've attached a description of the probelm below, but really, what I'm
> wondering is, if there's a description someplace of various places to turn on
> instrumentation in hbase, so I can figure out what's wrong.  I plan to do a lot
> of work with hbase in the future, so knowing how to debug it is in some
> sense more important than finding out the fix for this particular bug.
>
> I really am looking to learn how to fish here.  I'm sure I can slowly dig around
> find all the various tracing facilities and such, but I figured there might be a
> cheat-sheet someplace ....
>
> Thanks,
> --chet--
>
> =========================================================> =====>
> Basically, I set up hadoop 0.20.0 + hbase 0.20.6, in a cluster with 1 namenode,
> and anywhere from 2-5 datanodes which are also regionservers.  I'm running
> a single zookeeper node, since this is just for testing.  Furthermore, all these
> machines are isolated, high-performance, SMP, with lots of memory.
> Modern Intel/AMD boxes.
>
> The cluster which 'works" runs Fedora 9 on Opteron, and the one that "fails"
> runs RHEL5 on Intel Xeon (something-or-other -- I forget).
>
> The test I'm running is Yahoo Cluster benchmark (YCSB).  I'm just trying to
> load 1m records, and on the cluster that fails, I get,
> variously:
>
> (1) a load will fail with an error like:
>
> com.yahoo.ycsb.DBException:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server  -- nothing found, no 'location' returned,
> tableName=usertable, reload=true -- for region , row 'user1000015788', but
> failed after 11 attempts.
> Exceptions:
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server
> address listed in .META. for region usertable,,1294095537393
> org.apache.hadoop.hbase.client.NoServerForRegionException: No server
> address listed in .META. for region usertable,,1294095537393
>
> (b) a load will succeed, but there won't be 1m rows (where I use the "count"
> command in "hbase shell" to count).
>
> (c) sometimes, a "truncate" will fail, with an error of the form above.  the
> step which fails is the "disable" step.
>
> Java stack-dumps from the regionservers don't show any threads doing
> anything interesting.  I don't know how to interrogate Zookeeper; perhaps
> there's something messed-up in there ....