Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Re: Maximizing throughput


+
Asaf Mesika 2013-01-11, 13:03
+
Bryan Keller 2013-01-11, 17:37
Copy link to this message
-
Re: Maximizing throughput
Bryan Keller 2013-01-15, 17:28
I'll follow up on this in case it is useful to anyone. It seems I was network I/O limited. The switch I was using was in managed mode which decreased throughput to 1gbit/sec within the switch, not just on the wire. So with replication set to 2, throughput was about half of the theoretical max on a given box (client -> switch -> datanode 1 -> switch -> datanode 2). It was an eye opener that I was network I/O limited. I will probably move to a 10gbit/sec switch and/or use bonded NICs.

On Jan 11, 2013, at 9:37 AM, Bryan Keller <[EMAIL PROTECTED]> wrote:

> Thanks for the responses. I'm running HBase 0.92.1 (Cloudera CDH4).
>
> The program is very simple, it inserts batches of rows into a table via multiple threads. I've tried running it with different parameters (column count, threads, batch size, etc.), but throughput didn't improve. I've pasted the code here: http://pastebin.com/gPXfdkPy
>
> I have auto flush on (default) as I am inserting rows in batch so don't need to use the internal HTable write buffer.
>
> I've posted my config as well: http://pastebin.com/LVG9h6Z4
>
> The regionservers have 12 cores (24 with HT), 128 GB RAM, 6 SCSI drives Max throughput is 90-100mb/sec on a drive. I've also tested this on an EC2 High I/O instance type with 2 SSDs, 64GB RAM, and 8 cores (16 with HT). Both the EC2 and my colo cluster have the same issue of seemingly underutilizing resources.
>
> I measure disk usage using iostat and measured the theoretical max using hdparm dd. I use iftop to monitor network bandwidth usage, and used iperf to test theoretical max. CPU usage I use top and iostat.
>
> The maximum write performance I'm getting is usually around 20mb/sec on a drive (this is my colo cluster) on each of the 2 data nodes. That's about 20% of the max, and it is only sporadic, not a consistent 20mb/sec per drive. Network usage also seems to top out around 20% (200mbit/sec) to each node. CPU usage on each node is around 10%. The problem is more pronounced on EC2 which has much higher theoretical limits for storage and network I/O.
>
> Copying a 133gb file to HDFS looks like it gives similar performance as HBase (sporadic disk usage topping out at 20%, low CPU, 30-40% network I/O) so it seems this is more of an HDFS issue than an HBase issue.
>
+
Andrew Purtell 2013-01-15, 17:48
+
anil gupta 2013-01-15, 20:04