Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> commit semantics


Copy link to this message
-
Re: commit semantics
Even with 100 regions, times 1000 region servers, we talk about
potentially having 100 000 opened files instead of 1000 (and also we
have to count every replica).

I guess that an OS that was configured for such usage would be able to
sustain it... You would have to watch that metric cluster-wide, get
new nodes when needed, etc.

Then you need to make sure that GC pauses won't block for too long to
have a very low unavailability time.

J-D

On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
<[EMAIL PROTECTED]> wrote:
>> I presume you intend to run HBase region servers
>> colocated with HDFS DataNodes.
>
> Yes.
>
> ---
>
> Seems like we all generally agree that large number of regions per region server may not be the way to go.
>
> So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern?
>
> Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues?
>
> regards,
> Kannan
> -----Original Message-----
> From: Andrew Purtell [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, January 12, 2010 12:50 PM
> To: hbase[EMAIL PROTECTED]
> Subject: Re: commit semantics
>
>> But would say having a
>> smaller number of regions per region server (say ~50) be really bad.
>
> Not at all.
>
> There are some (test) HBase deployments I know of that go pretty
> vertical, multiple TBs of disk on each node therefore wanting a high
> number of regions per region server to match that density. That may meet
> with operational success but it is architecturally suspect. I ran a test
> cluster once with > 1,000 regions per server on 25 servers, in the 0.19
> timeframe. 0.20 is much better in terms of resource demand (less) and
> liveness (enormously improved), but I still wouldn't recommend it,
> unless your clients can wait for up to several minutes on blocked reads
> and writes to affected regions should a node go down. With that many
> regions per server,  it stands to reason just about every client would be
> affected.
>
> The numbers I have for Google's canonical BigTable deployment are several
> years out of date but they go pretty far in the other direction -- about
> 100 regions per server is the target.
>
> I think it also depends on whether you intend to colocate TaskTrackers
> with the region servers. I presume you intend to run HBase region servers
> colocated with HDFS DataNodes. After you have a HBase cluster up for some
> number of hours, certainly ~24, background compaction will bring the HDFS
> blocks backing region data local to the server, generally. MapReduce
> tasks backed by HBase tables will see similar advantages of data locality
> that you are probably accustomed to with working with files in HDFS. If
> you mix storage and computation this way it makes sense to seek a balance
> between the amount of data stored on each node (number of regions being
> served) and the available computational resources (available CPU cores,
> time constraints (if any) on task execution).
>
> Even if you don't intend to do the above, it's possible that an overly
> high region density can negatively impact performance if too much I/O
> load is placed on average on each region server. Adding more servers to
> spread load would then likely help**.
>
> These considerations bias against hosting a very large number of regions
> per region server.
>
>   - Andy
>
> **: I say likely because this presumes query and edit patterns have been
> guided as necessary through engineering to be widely distributed in the
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB