Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - commit semantics


Copy link to this message
-
Re: commit semantics
Jean-Daniel Cryans 2010-01-12, 21:36
Even with 100 regions, times 1000 region servers, we talk about
potentially having 100 000 opened files instead of 1000 (and also we
have to count every replica).

I guess that an OS that was configured for such usage would be able to
sustain it... You would have to watch that metric cluster-wide, get
new nodes when needed, etc.

Then you need to make sure that GC pauses won't block for too long to
have a very low unavailability time.

J-D

On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
<[EMAIL PROTECTED]> wrote:
>> I presume you intend to run HBase region servers
>> colocated with HDFS DataNodes.
>
> Yes.
>
> ---
>
> Seems like we all generally agree that large number of regions per region server may not be the way to go.
>
> So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern?
>
> Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues?
>
> regards,
> Kannan
> -----Original Message-----
> From: Andrew Purtell [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, January 12, 2010 12:50 PM
> To: hbase[EMAIL PROTECTED]
> Subject: Re: commit semantics
>
>> But would say having a
>> smaller number of regions per region server (say ~50) be really bad.
>
> Not at all.
>
> There are some (test) HBase deployments I know of that go pretty
> vertical, multiple TBs of disk on each node therefore wanting a high
> number of regions per region server to match that density. That may meet
> with operational success but it is architecturally suspect. I ran a test
> cluster once with > 1,000 regions per server on 25 servers, in the 0.19
> timeframe. 0.20 is much better in terms of resource demand (less) and
> liveness (enormously improved), but I still wouldn't recommend it,
> unless your clients can wait for up to several minutes on blocked reads
> and writes to affected regions should a node go down. With that many
> regions per server,  it stands to reason just about every client would be
> affected.
>
> The numbers I have for Google's canonical BigTable deployment are several
> years out of date but they go pretty far in the other direction -- about
> 100 regions per server is the target.
>
> I think it also depends on whether you intend to colocate TaskTrackers
> with the region servers. I presume you intend to run HBase region servers
> colocated with HDFS DataNodes. After you have a HBase cluster up for some
> number of hours, certainly ~24, background compaction will bring the HDFS
> blocks backing region data local to the server, generally. MapReduce
> tasks backed by HBase tables will see similar advantages of data locality
> that you are probably accustomed to with working with files in HDFS. If
> you mix storage and computation this way it makes sense to seek a balance
> between the amount of data stored on each node (number of regions being
> served) and the available computational resources (available CPU cores,
> time constraints (if any) on task execution).
>
> Even if you don't intend to do the above, it's possible that an overly
> high region density can negatively impact performance if too much I/O
> load is placed on average on each region server. Adding more servers to
> spread load would then likely help**.
>
> These considerations bias against hosting a very large number of regions
> per region server.
>
>   - Andy
>
> **: I say likely because this presumes query and edit patterns have been
> guided as necessary through engineering to be widely distributed in the