Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Replication not suited for intensive write applications?


+
Asaf Mesika 2013-06-20, 10:46
+
Varun Sharma 2013-06-20, 16:12
+
Asaf Mesika 2013-06-20, 18:10
+
Varun Sharma 2013-06-20, 19:04
+
lars hofhansl 2013-06-20, 20:02
+
Asaf Mesika 2013-06-20, 20:38
Copy link to this message
-
Re: Replication not suited for intensive write applications?
lars hofhansl 2013-06-20, 22:47
I see.

In HBase you have machines for both CPU (to serve requests) and storage (to hold the data).

If you only grow your cluster for CPU and you keep all RegionServers 100% busy at all times, you are correct.

Maybe you need to increase replication.source.size.capacity and/or replication.source.nb.capacity (although I doubt that this will help here).

Also a replication source will pick region server from the target at random (10% of them at default). That has two effects:
1. Each source will pick exactly one RS at the target: ceil (3*0.1)=1
2. With such a small cluster setup the likelihood is high that two or more RSs in the source will happen to pick the same RS at the target. Thus leading less throughput.

In fact your numbers might indicate that two of your source RSs might have picked the same target (you get 2/3 of your throughput via replication).
In any case, before drawing conclusions this should be tested with a larger cluster.
Maybe set replication.source.ratio from 0.1 to 1 (thus the source RSs will round robin all target RSs and lead to better distribution), but that might have other side-effects, too.

Did you measure the disk IO at each RS at the target? Maybe one of them is mostly idle.

-- Lars
________________________________
From: Asaf Mesika <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]>
Sent: Thursday, June 20, 2013 1:38 PM
Subject: Re: Replication not suited for intensive write applications?
Thanks for the answer!
My responses are inline.

On Thu, Jun 20, 2013 at 11:02 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> First off, this is a pretty constructed case leading to a specious general
> conclusion.
>
> If you only have three RSs/DNs and the default replication factor of 3,
> each machine will get every single write.
> That is the first issue. Using HBase makes little sense with such a small
> cluster.
>
You are correct, non the less - network as I measured, was far from its
capacity thus probably not the bottleneck.

>
> Secondly, as you say yourself, there are only three regionservers writing
> to the replicated cluster using a single thread each in order to preserve
> ordering.
> With more region servers your scale will tip the other way. Again more
> regionservers will make this better.
>
> I presume, in production, I will add more region servers to accommodate
growing write demand on my cluster. Hence, my clients will write with more
threads. Thus proportionally I will always have a lot more client threads
than the number of region servers (each has one replication thread). So, I
don't see how adding more region servers will tip the scale to other side.
The only way to avoid this, is to design the cluster in such a way that if
I can handle the events received at the client which write them to HBase
with x Threads, this is the amount of region servers I should have. If I
will have a spike, then it will even out eventually, but this under
utilizing my cluster hardware, no?
> As for your other question, more threads can lead to better interleaving
> of CPU and IO, thus leading to better throughput (this relationship is not
> linear, though).
>
>

>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Asaf Mesika <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Cc:
> Sent: Thursday, June 20, 2013 3:46 AM
> Subject: Replication not suited for intensive write applications?
>
> Hi,
>
> I've been conducting lots of benchmarks to test the maximum throughput of
> replication in HBase.
>
> I've come to the conclusion that HBase replication is not suited for write
> intensive application. I hope that people here can show me where I'm wrong.
>
> *My setup*
> *Cluster (*Master and slave are alike)
> 1 Master, NameNode
> 3 RS, Data Node
>
> All computers are the same: 8 Cores x 3.4 GHz, 8 GB Ram, 1 Gigabit ethernet
> card
>
> I insert data into HBase from a java process (client) reading files from
> disk, running on the machine running the HBase Master in the master
+
Asaf Mesika 2013-06-21, 05:16
+
lars hofhansl 2013-06-21, 08:48
+
lars hofhansl 2013-06-21, 11:38
+
Asaf Mesika 2013-06-21, 12:50
+
lars hofhansl 2013-06-21, 13:05
+
Jean-Daniel Cryans 2013-06-21, 21:18
+
Asaf Mesika 2013-06-23, 06:33
+
Jean-Daniel Cryans 2013-06-24, 20:29