Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Maximizing throughput


+
Asaf Mesika 2013-01-11, 13:03
Copy link to this message
-
Re: Maximizing throughput
Thanks for the responses. I'm running HBase 0.92.1 (Cloudera CDH4).

The program is very simple, it inserts batches of rows into a table via multiple threads. I've tried running it with different parameters (column count, threads, batch size, etc.), but throughput didn't improve. I've pasted the code here: http://pastebin.com/gPXfdkPy

I have auto flush on (default) as I am inserting rows in batch so don't need to use the internal HTable write buffer.

I've posted my config as well: http://pastebin.com/LVG9h6Z4

The regionservers have 12 cores (24 with HT), 128 GB RAM, 6 SCSI drives Max throughput is 90-100mb/sec on a drive. I've also tested this on an EC2 High I/O instance type with 2 SSDs, 64GB RAM, and 8 cores (16 with HT). Both the EC2 and my colo cluster have the same issue of seemingly underutilizing resources.

I measure disk usage using iostat and measured the theoretical max using hdparm dd. I use iftop to monitor network bandwidth usage, and used iperf to test theoretical max. CPU usage I use top and iostat.

The maximum write performance I'm getting is usually around 20mb/sec on a drive (this is my colo cluster) on each of the 2 data nodes. That's about 20% of the max, and it is only sporadic, not a consistent 20mb/sec per drive. Network usage also seems to top out around 20% (200mbit/sec) to each node. CPU usage on each node is around 10%. The problem is more pronounced on EC2 which has much higher theoretical limits for storage and network I/O.

Copying a 133gb file to HDFS looks like it gives similar performance as HBase (sporadic disk usage topping out at 20%, low CPU, 30-40% network I/O) so it seems this is more of an HDFS issue than an HBase issue.
+
Bryan Keller 2013-01-15, 17:28
+
Andrew Purtell 2013-01-15, 17:48
+
anil gupta 2013-01-15, 20:04
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB