Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> commit semantics


Copy link to this message
-
RE: commit semantics
Btw, is there much gains in having a large number of regions-- i.e. to the tune of 500 -- per region server?

I understand that having multiple regions per region server allows finer grained rebalancing when new nodes are added or a node goes down. But would say having a smaller number of regions per region server (say ~50) be really bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. Are there other advantages of having a large number of regions per region server?

regards,
Kannan
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, January 12, 2010 9:42 AM
To: [EMAIL PROTECTED]
Subject: Re: commit semantics

wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.

J-D

On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote:
> Hi Ryan,
>
> thanks for ur response.
>
>>Right now each regionserver has 1 log, so if 2 puts on different
>>tables hit the same RS, they hit the same HLog.
>
> I understand. My point was that the application could insert the same record
> into two different tables on two different Hbase instances on two different
> piece of hardware.
>
> On a related note, can somebody explain what the tradeoff is if each region
> has its own hlog? are you worried about the number of files in HDFS? or
> maybe the number of sync-threads in the region server? Can multiple hlog
> files provide faster region splits?
>
>
>> I've thought about this issue quite a bit, and I think the sync every
>> 1 rows combined with optional no-sync and low time sync() is the way
>> to go. If you want to discuss this more in person, maybe we can meet
>> up for brews or something.
>>
>
> The group-commit thing I can understand. HDFS does a very similar thing. But
> can you explain your alternative "sync every 1 rows combined with optional
> no-sync and low time sync"? For those applications that have the natural
> characteristics of updating only one row per logical operation, how can they
> be sure that their data has reached some-sort-of-stable-storage unless they
> sync after every row update?
>
> thanks,
> dhruba
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB