Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase wire compatibility


Copy link to this message
-
Re: HBase wire compatibility
On Thu, Feb 16, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote:
> It seems like the only heavy part of the client would be the zookeeper
> interactions (forgive my ignorance if I'm wrong).

ZooKeeper interactions are extremely simple for a client, that's not
where the heavy part is.  All a client needs to do with ZooKeeper is
to find where the -ROOT- region is, period.  In the client I wrote,
asynchbase, I don't even maintain an open connection to ZooKeeper,
because 99.99% of the time it's unnecessary.

> Other than zookeeper only
> a basic understanding of regions need to be understood.  So if the zookeeper
> interactions could be removed and pushed somewhere else in the stack that
> could make the client much thinner.

Using line count (per "wc -l") as a rough approximation of code
complexity, here's a break down of asynchbase.  For a total of 11k
lines the big chunks of code are:
ZooKeeper code: 360 lines (not actually big but I included it for comparison)
Code for handling NoSuchRegionException: 500 lines
Helper code to deal with byte arrays: 500 lines
Helper code to deal with HBase RPC serialization: 700 lines
Code to batch RPCs: 800 lines
Low-level socket code, and wire serialization/deserialization: 800 lines
Code to open, manage, close scanners: 1000 lines
Code for looking up and caching regions: 1000 lines

> hopefully never again.  IMHO since you are redoing the communication why not
> improve the protocol to allow for a leaner the client.  A leaner client
> would be more likely to work across major hbase changes, would be easier to
> maintain, would hide implementation details and could have less
> dependencies.

Yes a leaner client would be better.  But the reason the client is fat
is because Bigtable's design pushed a lot of logic down to the clients
in order to be able to make RPC routing decisions there, and relieve
the tablet servers from having to do it.  When you start to have tens
of thousands of clients talking to a cluster, like Google does, it
makes sense to push this work down to the many clients, rather than
have the fewer TabletServers do it and re-route packets (adding extra
hops etc).  The overall system is more efficient this way.

Leaner clients are better, but unfortunately lean clients are often
dumb, so it's hard to find a good tradeoff between simplicity and
efficiency.

> One of the reasons the client doesn't do well across major
> changes is because of how heavy it is. Even if the client is never
> implemented in another language a thinner client would seem to be an
> improvement.

Having maintained an HBase client written from scratch for about 2
years now, I can tell you that the only things I had to fix across
HBase release were wire-level serialization breakages.  The heavy
logic of the client has remained mostly unchanged since the days of
HBase 0.20.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB