Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase Thrift inserts bottlenecked somewhere -- but where?


Copy link to this message
-
Re: HBase Thrift inserts bottlenecked somewhere -- but where?
Andrew Purtell 2013-03-05, 07:04
> We're doing about 20,000 writes per second sustained across 4 tables and
6 CFs. Does this sound ballpark right for 6x EC2 m1.xlarges?

c1.xlarge will provide better performance.

m1.xlarge has pretty poor IO performance overall relative to real hardware.

Well.. so does c1.xlarge, but m1.xlarge is worse yet.

AWS is a big red flag in any discussion about performance. Systems like
Hadoop, HBase, etc. are IO intensive ultimately. When AWS built DynamoDB
they created the "high IO" instance type to host it. If you ran HBase on
that it would perform pretty well too, but you'd perhaps pale at the
monthly cost of it relative to a m1.xlarge.

Other suggestions on this thread may help, but ultimately the AWS platform
is the limiting factor, unless you move to a "high IO" or "cluster compute"
type. Ironically options for server hosting like SoftLayer are far less
expensive than AWS rates at 24/7 utilization.
On Sun, Mar 3, 2013 at 1:12 AM, Dan Crosta <[EMAIL PROTECTED]> wrote:

> On Mar 1, 2013, at 10:42 PM, lars hofhansl wrote:
> > What performance profile do you expect?
>
> That's a good question. Our configuration is actually already exceeding
> our minimum and desired performance thresholds, so I'm not too worried
> about it. My concern is more that I develop an understanding of where the
> bottlenecks are (e.g. it doesn't appear to be disk, CPU, or network bound
> at the moment), and develop an intuition for working with HBase in case we
> are ever under the gun.
>
>
> > Where does it top out (i.e. how many ops/sec)?
>
> We're doing about 20,000 writes per second sustained across 4 tables and 6
> CFs. Does this sound ballpark right for 6x EC2 m1.xlarges?
>
>
> > Also note that each data item is replicated to three nodes (by HDFS). So
> in a 6 machine cluster each machine would get 50% of the writes.
> > If you are looking for performance you really need a larger cluster to
> amortize this replication cost across more machines.
>
> That's only true from the HDFS perspective, right? Any given region is
> "owned" by 1 of the 6 regionservers at any given time, and writes are
> buffered to memory before being persisted to HDFS, right?
>
> In any event, there doesn't seem to be any disk contention to speak of --
> we average around 10% disk utilization at this level of load (each machine
> has 4 spindles of local storage, we are not using EBS).
>
> One setting no one has mentioned yet is the DataNode handler count
> (dfs.datanode.handler.count) -- which is currently set to its default of 3.
> Should we experiment with raising that?
>
>
> > The other issue to watch out for is whether your keys are generated such
> that a single regionserver is hot spotted (you can look at the operation
> count on the master page).
>
> All of our keys are hashes or UUIDs, so the key distribution is very
> smooth, and this is confirmed by the "Region Servers" table on the master
> node's web UI.
>
>
> Thanks,
> - Dan
--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)