|
Joydeep Sarma
2010-01-11, 23:46
Ryan Rawson
2010-01-11, 23:58
Jean-Daniel Cryans
2010-01-12, 00:03
Joydeep Sarma
2010-01-12, 04:12
Jean-Daniel Cryans
2010-01-12, 04:48
Dhruba Borthakur
2010-01-12, 06:25
Ryan Rawson
2010-01-12, 06:53
Dhruba Borthakur
2010-01-12, 08:24
Ryan Rawson
2010-01-12, 08:39
Jean-Daniel Cryans
2010-01-12, 17:41
stack
2010-01-12, 17:58
Dhruba Borthakur
2010-01-12, 18:14
stack
2010-01-12, 18:51
Kannan Muthukkaruppan
2010-01-12, 19:29
Kannan Muthukkaruppan
2010-01-12, 19:40
Jean-Daniel Cryans
2010-01-12, 19:43
Jean-Daniel Cryans
2010-01-12, 19:53
Kannan Muthukkaruppan
2010-01-12, 20:10
Andrew Purtell
2010-01-12, 20:49
Kannan Muthukkaruppan
2010-01-12, 21:07
Jean-Daniel Cryans
2010-01-12, 21:36
stack
2010-01-13, 01:23
stack
2010-01-13, 05:12
stack
2010-01-13, 05:16
Joydeep Sarma
2010-01-13, 05:41
Dhruba Borthakur
2010-01-13, 16:51
Jean-Daniel Cryans
2010-01-13, 17:56
Dhruba Borthakur
2010-01-13, 18:38
Jean-Daniel Cryans
2010-01-13, 18:40
Dhruba Borthakur
2010-01-13, 18:43
Joydeep Sarma
2010-01-13, 19:01
Jean-Daniel Cryans
2010-01-13, 22:56
|
-
commit semanticsJoydeep Sarma 2010-01-11, 23:46
Hey HBase-devs,
we have been going through hbase code to come up to speed. One of the questions was regarding the commit semantics. Thumbing through the RegionServer code that's appending to the wal: syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() and the log writer thread calls: hflush(), syncDone.signalAll() however hflush doesn't necessarily call a sync on the underlying log file: if (this.forceSync || this.unflushedEntries.get() >= this.flushlogentries) { ... sync() ... } so it seems that if forceSync is not true, the syncWal can unblock before a sync is called (and forcesync seems to be only true for metaregion()). are we missing something - or is there a bug here (the signalAll should be conditional on hflush having actually flushed something). thanks, Joydeep
-
Re: commit semanticsRyan Rawson 2010-01-11, 23:58
Performance.... It's all about performance.
In my own tests, calling sync() in HDFS-0.21 on every single commit can limit the number of small rows you do to about a max of 1200 a second. One way to speed things up is to sync less often. Another way is to sync on a timer instead. Both of these are going to be way more important in HDFS-0.21/Hbase-0.21. If we are talking about hdfs/hadoop 0.20, it hardly matters either way, there is that whole 'no append/sync' thing you know all about. -ryan On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > Hey HBase-devs, > > we have been going through hbase code to come up to speed. > > One of the questions was regarding the commit semantics. Thumbing through > the RegionServer code that's appending to the wal: > > syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() > > and the log writer thread calls: > > hflush(), syncDone.signalAll() > > however hflush doesn't necessarily call a sync on the underlying log file: > > if (this.forceSync || > this.unflushedEntries.get() >= this.flushlogentries) { ... sync() > ... } > > so it seems that if forceSync is not true, the syncWal can unblock before a > sync is called (and forcesync seems to be only true for metaregion()). > > are we missing something - or is there a bug here (the signalAll should be > conditional on hflush having actually flushed something). > > thanks, > > Joydeep >
-
Re: commit semanticsJean-Daniel Cryans 2010-01-12, 00:03
Hey Joydeep,
This is actually intended this way but the name of the variable is misleading. The sync is done only if forceSync or we have enough entries to sync (default is 1). If someone wants to sync only 100 entries for example, they would play with that configuration. Hope that helps, J-D On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > Hey HBase-devs, > > we have been going through hbase code to come up to speed. > > One of the questions was regarding the commit semantics. Thumbing through > the RegionServer code that's appending to the wal: > > syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() > > and the log writer thread calls: > > hflush(), syncDone.signalAll() > > however hflush doesn't necessarily call a sync on the underlying log file: > > if (this.forceSync || > this.unflushedEntries.get() >= this.flushlogentries) { ... sync() > ... } > > so it seems that if forceSync is not true, the syncWal can unblock before a > sync is called (and forcesync seems to be only true for metaregion()). > > are we missing something - or is there a bug here (the signalAll should be > conditional on hflush having actually flushed something). > > thanks, > > Joydeep >
-
Re: commit semanticsJoydeep Sarma 2010-01-12, 04:12
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct. under high load - some batching would automatically happen at this setting (or so one would think - not sure if hdfs appends are blocked on pending syncs (in which case the batching wouldn't quite happen i think) - cc'ing Dhruba). if the performance with setting of 1 doesn't work out - we may need an option to delay acks until actual syncs .. (most likely we would be able to compromise on latency to get higher throughput - but wouldn't be willing to compromise on data integrity) > Hey Joydeep, > > This is actually intended this way but the name of the variable is > misleading. The sync is done only if forceSync or we have enough > entries to sync (default is 1). If someone wants to sync only 100 > entries for example, they would play with that configuration. > > Hope that helps, > > J-D > > > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: >> >> Hey HBase-devs, >> >> we have been going through hbase code to come up to speed. >> >> One of the questions was regarding the commit semantics. Thumbing through the RegionServer code that's appending to the wal: >> >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() >> >> and the log writer thread calls: >> >> hflush(), syncDone.signalAll() >> >> however hflush doesn't necessarily call a sync on the underlying log file: >> >> if (this.forceSync || >> this.unflushedEntries.get() >= this.flushlogentries) { ... sync() ... } >> >> so it seems that if forceSync is not true, the syncWal can unblock before a sync is called (and forcesync seems to be only true for metaregion()). >> >> are we missing something - or is there a bug here (the signalAll should be conditional on hflush having actually flushed something). >> >> thanks, >> >> Joydeep >
-
Re: commit semanticsJean-Daniel Cryans 2010-01-12, 04:48
Inline.
J-D On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > ok - hadn't thought about it that way - but yeah with a default of 1 - > the semantics seem correct. > > under high load - some batching would automatically happen at this > setting (or so one would think - not sure if hdfs appends are blocked > on pending syncs (in which case the batching wouldn't quite happen i > think) - cc'ing Dhruba). Yes this is our version of group commit. > > if the performance with setting of 1 doesn't work out - we may need an > option to delay acks until actual syncs .. (most likely we would be > able to compromise on latency to get higher throughput - but wouldn't > be willing to compromise on data integrity) Good idea, we don't currently support that feature although we have the opposite running by default which is deferred log flush. Tables are never sync'ed and they rely on the LogSyncer thread awaitNanos' timeout (configurable) or tables that are highly durable. In our opinion, a cluster with a healthy mix of deferred and non-deferred tables still guarantees a very high level of durability for the default setting.
-
Re: commit semanticsDhruba Borthakur 2010-01-12, 06:25
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands. if we want the best of both worlds.. latency as well as data integrity, how about inserting the same record into two completely separate HBase tables in parallel... the operation can complete as soon as the record is inserted into the first HBase table (thus giving low latencies) but data integrity will not be compromised because it is unlikely that two region servers will fail exactly at the same time (assuming that there is a way to ensure that these two tables are not handled by the same region server). thanks, dhruba On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > ok - hadn't thought about it that way - but yeah with a default of 1 - > the semantics seem correct. > > under high load - some batching would automatically happen at this > setting (or so one would think - not sure if hdfs appends are blocked > on pending syncs (in which case the batching wouldn't quite happen i > think) - cc'ing Dhruba). > > if the performance with setting of 1 doesn't work out - we may need an > option to delay acks until actual syncs .. (most likely we would be > able to compromise on latency to get higher throughput - but wouldn't > be willing to compromise on data integrity) > > > Hey Joydeep, > > > > This is actually intended this way but the name of the variable is > > misleading. The sync is done only if forceSync or we have enough > > entries to sync (default is 1). If someone wants to sync only 100 > > entries for example, they would play with that configuration. > > > > Hope that helps, > > > > J-D > > > > > > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <[EMAIL PROTECTED]> > wrote: > >> > >> Hey HBase-devs, > >> > >> we have been going through hbase code to come up to speed. > >> > >> One of the questions was regarding the commit semantics. Thumbing > through the RegionServer code that's appending to the wal: > >> > >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() > >> > >> and the log writer thread calls: > >> > >> hflush(), syncDone.signalAll() > >> > >> however hflush doesn't necessarily call a sync on the underlying log > file: > >> > >> if (this.forceSync || > >> this.unflushedEntries.get() >= this.flushlogentries) { ... > sync() ... } > >> > >> so it seems that if forceSync is not true, the syncWal can unblock > before a sync is called (and forcesync seems to be only true for > metaregion()). > >> > >> are we missing something - or is there a bug here (the signalAll should > be conditional on hflush having actually flushed something). > >> > >> thanks, > >> > >> Joydeep > > > -- Connect to me at http://www.facebook.com/dhruba
-
Re: commit semanticsRyan Rawson 2010-01-12, 06:53
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog. There are 2 performance enhancing things in trunk: - bulk commit - we only call sync() once per RPC, no matter how many rows are involved. If you use the batch put API you can get really high levels of performance. - group commit - we can take multiple thread's worth of sync()s and do it in one, not N. This improves performance while maintaining high data security. If you are expecting very high concurrency, group commit is your friend. The more concurrent operations, the more rows per sync you are capturing and the higher overall rows/sec performance you can achieve while the same number of sync() calls/sec performance remains constant. The other option is to sync() on a fine grained timer, eg: every 10ms (or at 100hz). The window of data loss is small, and the performance boost is substantial. I asked JD to implement a switchable config so that you can chose on a table-by-table basis the right mix of performance vs persistence with a better control feature than merely "sync every N rows". I've thought about this issue quite a bit, and I think the sync every 1 rows combined with optional no-sync and low time sync() is the way to go. If you want to discuss this more in person, maybe we can meet up for brews or something. -ryan On Mon, Jan 11, 2010 at 10:25 PM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a > pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands. > > if we want the best of both worlds.. latency as well as data integrity, how > about inserting the same record into two completely separate HBase tables in > parallel... the operation can complete as soon as the record is inserted > into the first HBase table (thus giving low latencies) but data integrity > will not be compromised because it is unlikely that two region servers will > fail exactly at the same time (assuming that there is a way to ensure that > these two tables are not handled by the same region server). > > thanks, > dhruba > > > On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > >> ok - hadn't thought about it that way - but yeah with a default of 1 - >> the semantics seem correct. >> >> under high load - some batching would automatically happen at this >> setting (or so one would think - not sure if hdfs appends are blocked >> on pending syncs (in which case the batching wouldn't quite happen i >> think) - cc'ing Dhruba). >> >> if the performance with setting of 1 doesn't work out - we may need an >> option to delay acks until actual syncs .. (most likely we would be >> able to compromise on latency to get higher throughput - but wouldn't >> be willing to compromise on data integrity) >> >> > Hey Joydeep, >> > >> > This is actually intended this way but the name of the variable is >> > misleading. The sync is done only if forceSync or we have enough >> > entries to sync (default is 1). If someone wants to sync only 100 >> > entries for example, they would play with that configuration. >> > >> > Hope that helps, >> > >> > J-D >> > >> > >> > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <[EMAIL PROTECTED]> >> wrote: >> >> >> >> Hey HBase-devs, >> >> >> >> we have been going through hbase code to come up to speed. >> >> >> >> One of the questions was regarding the commit semantics. Thumbing >> through the RegionServer code that's appending to the wal: >> >> >> >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() >> >> >> >> and the log writer thread calls: >> >> >> >> hflush(), syncDone.signalAll() >> >> >> >> however hflush doesn't necessarily call a sync on the underlying log >> file: >> >> >> >> if (this.forceSync || >> >> this.unflushedEntries.get() >= this.flushlogentries) { ... >> sync() ... } >> >> >> >> so it seems that if forceSync is not true, the syncWal can unblock >> before a sync is called (and forcesync seems to be only true for
-
Re: commit semanticsDhruba Borthakur 2010-01-12, 08:24
Hi Ryan,
thanks for ur response. >Right now each regionserver has 1 log, so if 2 puts on different >tables hit the same RS, they hit the same HLog. I understand. My point was that the application could insert the same record into two different tables on two different Hbase instances on two different piece of hardware. On a related note, can somebody explain what the tradeoff is if each region has its own hlog? are you worried about the number of files in HDFS? or maybe the number of sync-threads in the region server? Can multiple hlog files provide faster region splits? > I've thought about this issue quite a bit, and I think the sync every > 1 rows combined with optional no-sync and low time sync() is the way > to go. If you want to discuss this more in person, maybe we can meet > up for brews or something. > The group-commit thing I can understand. HDFS does a very similar thing. But can you explain your alternative "sync every 1 rows combined with optional no-sync and low time sync"? For those applications that have the natural characteristics of updating only one row per logical operation, how can they be sure that their data has reached some-sort-of-stable-storage unless they sync after every row update? thanks, dhruba
-
Re: commit semanticsRyan Rawson 2010-01-12, 08:39
On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote:
> Hi Ryan, > > thanks for ur response. > >>Right now each regionserver has 1 log, so if 2 puts on different >>tables hit the same RS, they hit the same HLog. > > I understand. My point was that the application could insert the same record > into two different tables on two different Hbase instances on two different > piece of hardware. Ah yes, of course, I thought you meant 2 tables in the same cluster. > > On a related note, can somebody explain what the tradeoff is if each region > has its own hlog? are you worried about the number of files in HDFS? or > maybe the number of sync-threads in the region server? Can multiple hlog > files provide faster region splits? So each hlog needs to be treated as a stream of edits for log recovery. So adding more logs, requires the code to still treat the pool as 1 log and keep an overall ordering across all logs as a merged set. It just adds complexity, and I'd like to put it off as long as possible. Initially when I was worried about performance issues, adding a pool only extended the performance by a linear amount, and I was looking for substantially more than that. > > >> I've thought about this issue quite a bit, and I think the sync every >> 1 rows combined with optional no-sync and low time sync() is the way >> to go. If you want to discuss this more in person, maybe we can meet >> up for brews or something. >> > > The group-commit thing I can understand. HDFS does a very similar thing. But > can you explain your alternative "sync every 1 rows combined with optional > no-sync and low time sync"? For those applications that have the natural > characteristics of updating only one row per logical operation, how can they > be sure that their data has reached some-sort-of-stable-storage unless they > sync after every row update? Normally this would be the case, but consider the case of the call 'incrementColumnValue' which maintains a counter essentially. Losing some edits means losing counter values - if we we are talking about a counter that is incremented 100m times a day, then speed is more important than potentially losing some extremely small number of updates when a server crashes. -ryan > > thanks, > dhruba >
-
Re: commit semanticsJean-Daniel Cryans 2010-01-12, 17:41
wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000 region servers * 500 regions then you may have 100 000 HLogs to manage. Also you can have more than one file per HLog, so let's say you have on average 5 log files per HLog that's 500 000 files on HDFS. J-D On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > thanks for ur response. > >>Right now each regionserver has 1 log, so if 2 puts on different >>tables hit the same RS, they hit the same HLog. > > I understand. My point was that the application could insert the same record > into two different tables on two different Hbase instances on two different > piece of hardware. > > On a related note, can somebody explain what the tradeoff is if each region > has its own hlog? are you worried about the number of files in HDFS? or > maybe the number of sync-threads in the region server? Can multiple hlog > files provide faster region splits? > > >> I've thought about this issue quite a bit, and I think the sync every >> 1 rows combined with optional no-sync and low time sync() is the way >> to go. If you want to discuss this more in person, maybe we can meet >> up for brews or something. >> > > The group-commit thing I can understand. HDFS does a very similar thing. But > can you explain your alternative "sync every 1 rows combined with optional > no-sync and low time sync"? For those applications that have the natural > characteristics of updating only one row per logical operation, how can they > be sure that their data has reached some-sort-of-stable-storage unless they > sync after every row update? > > thanks, > dhruba >
-
Re: commit semanticsstack 2010-01-12, 17:58
On Mon, Jan 11, 2010 at 10:25 PM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote:
> if we want the best of both worlds.. latency as well as data integrity, how > about inserting the same record into two completely separate HBase tables > in > parallel... the operation can complete as soon as the record is inserted > into the first HBase table (thus giving low latencies) Return after insert into the first table? Then internally hbase is meant to take care of the insert into the second table? What if the latter fails for some reason other than regionserver crash? The two writes would have to be done as hdfs does, in series, if the two tables are to remain in sync, with the addition of a roll back of the transaction if insert does not go through to both tables since we don't have something like the hdfs background thread ensuring replica counts. > but data integrity > will not be compromised because it is unlikely that two region servers will > fail exactly at the same time (assuming that there is a way to ensure that > these two tables are not handled by the same region server). > How do you suggest the application deal with reading from these two tables? If they are guaranteed in-sync, then it could pick either. If the two can wander, then the application needs to read from both and make reconciliation somehow? Just trying to understand what you are suggesting Dhruba, St.Ack > > thanks, > dhruba > > > On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > > > ok - hadn't thought about it that way - but yeah with a default of 1 - > > the semantics seem correct. > > > > under high load - some batching would automatically happen at this > > setting (or so one would think - not sure if hdfs appends are blocked > > on pending syncs (in which case the batching wouldn't quite happen i > > think) - cc'ing Dhruba). > > > > if the performance with setting of 1 doesn't work out - we may need an > > option to delay acks until actual syncs .. (most likely we would be > > able to compromise on latency to get higher throughput - but wouldn't > > be willing to compromise on data integrity) > > > > > Hey Joydeep, > > > > > > This is actually intended this way but the name of the variable is > > > misleading. The sync is done only if forceSync or we have enough > > > entries to sync (default is 1). If someone wants to sync only 100 > > > entries for example, they would play with that configuration. > > > > > > Hope that helps, > > > > > > J-D > > > > > > > > > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <[EMAIL PROTECTED]> > > wrote: > > >> > > >> Hey HBase-devs, > > >> > > >> we have been going through hbase code to come up to speed. > > >> > > >> One of the questions was regarding the commit semantics. Thumbing > > through the RegionServer code that's appending to the wal: > > >> > > >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() > > >> > > >> and the log writer thread calls: > > >> > > >> hflush(), syncDone.signalAll() > > >> > > >> however hflush doesn't necessarily call a sync on the underlying log > > file: > > >> > > >> if (this.forceSync || > > >> this.unflushedEntries.get() >= this.flushlogentries) { ... > > sync() ... } > > >> > > >> so it seems that if forceSync is not true, the syncWal can unblock > > before a sync is called (and forcesync seems to be only true for > > metaregion()). > > >> > > >> are we missing something - or is there a bug here (the signalAll > should > > be conditional on hflush having actually flushed something). > > >> > > >> thanks, > > >> > > >> Joydeep > > > > > > > > > -- > Connect to me at http://www.facebook.com/dhruba >
-
Re: commit semanticsDhruba Borthakur 2010-01-12, 18:14
Hi stack,
I was meaning "what if the application inserted the same record into two Hbase instances"? Of course, now the onus is on the appl to keep both of them in sync and recover from any inconsistencies between them. thanks, dhruba On Tue, Jan 12, 2010 at 9:58 AM, stack <[EMAIL PROTECTED]> wrote: > On Mon, Jan 11, 2010 at 10:25 PM, Dhruba Borthakur <[EMAIL PROTECTED]> > wrote: > > > if we want the best of both worlds.. latency as well as data integrity, > how > > about inserting the same record into two completely separate HBase tables > > in > > parallel... the operation can complete as soon as the record is inserted > > into the first HBase table (thus giving low latencies) > > > Return after insert into the first table? Then internally hbase is meant > to > take care of the insert into the second table? What if the latter fails > for > some reason other than regionserver crash? > > The two writes would have to be done as hdfs does, in series, if the two > tables are to remain in sync, with the addition of a roll back of the > transaction if insert does not go through to both tables since we don't > have > something like the hdfs background thread ensuring replica counts. > > > > but data integrity > > will not be compromised because it is unlikely that two region servers > will > > fail exactly at the same time (assuming that there is a way to ensure > that > > these two tables are not handled by the same region server). > > > > How do you suggest the application deal with reading from these two tables? > If they are guaranteed in-sync, then it could pick either. If the two can > wander, then the application needs to read from both and make > reconciliation > somehow? > > Just trying to understand what you are suggesting Dhruba, > St.Ack > > > > > > > thanks, > > dhruba > > > > > > On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma <[EMAIL PROTECTED]> > wrote: > > > > > ok - hadn't thought about it that way - but yeah with a default of 1 - > > > the semantics seem correct. > > > > > > under high load - some batching would automatically happen at this > > > setting (or so one would think - not sure if hdfs appends are blocked > > > on pending syncs (in which case the batching wouldn't quite happen i > > > think) - cc'ing Dhruba). > > > > > > if the performance with setting of 1 doesn't work out - we may need an > > > option to delay acks until actual syncs .. (most likely we would be > > > able to compromise on latency to get higher throughput - but wouldn't > > > be willing to compromise on data integrity) > > > > > > > Hey Joydeep, > > > > > > > > This is actually intended this way but the name of the variable is > > > > misleading. The sync is done only if forceSync or we have enough > > > > entries to sync (default is 1). If someone wants to sync only 100 > > > > entries for example, they would play with that configuration. > > > > > > > > Hope that helps, > > > > > > > > J-D > > > > > > > > > > > > On Mon, Jan 11, 2010 at 3:46 PM, Joydeep Sarma <[EMAIL PROTECTED]> > > > wrote: > > > >> > > > >> Hey HBase-devs, > > > >> > > > >> we have been going through hbase code to come up to speed. > > > >> > > > >> One of the questions was regarding the commit semantics. Thumbing > > > through the RegionServer code that's appending to the wal: > > > >> > > > >> syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await() > > > >> > > > >> and the log writer thread calls: > > > >> > > > >> hflush(), syncDone.signalAll() > > > >> > > > >> however hflush doesn't necessarily call a sync on the underlying log > > > file: > > > >> > > > >> if (this.forceSync || > > > >> this.unflushedEntries.get() >= this.flushlogentries) { ... > > > sync() ... } > > > >> > > > >> so it seems that if forceSync is not true, the syncWal can unblock > > > before a sync is called (and forcesync seems to be only true for > > > metaregion()). > > > >> > > > >> are we missing something - or is there a bug here (the signalAll > > should > > > be conditional on hflush having actually flushed something). Connect to me at http://www.facebook.com/dhruba
-
Re: commit semanticsstack 2010-01-12, 18:51
On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote:
> Hi stack, > > I was meaning "what if the application inserted the same record into two > Hbase instances"? Of course, now the onus is on the appl to keep both of > them in sync and recover from any inconsistencies between them. > > Ok. Like your "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf? I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above. It could write in parallel but the second thread might not complete for myriad reasons. What then? And as you say, reading, the client would have to make reconciliation. Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)? Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests? Will that be sufficient data integrity-wise? St.Ack Thanks, St.Ack
-
RE: commit semanticsKannan Muthukkaruppan 2010-01-12, 19:29
Dhruba & I just talked off-line about this as well. Yes, writing to two clusters would result in unnecessary complexity... we will essentially need to deal with inconsistencies between the two clusters at the application level.
For data integrity, going with group commits (batch commits) seems like a good option. My understanding of group commits as implemented in 0.21 is as follows: * We wait on acknowledging back to the client until the transaction has been synced to HDFS. * Syncs are batched-a sync is called if the queue has enough transactions or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of stack Sent: Tuesday, January 12, 2010 10:52 AM To: [EMAIL PROTECTED] Cc: Kannan Muthukkaruppan; Dhruba Borthakur Subject: Re: commit semantics On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi stack, I was meaning "what if the application inserted the same record into two Hbase instances"? Of course, now the onus is on the appl to keep both of them in sync and recover from any inconsistencies between them. Ok. Like your "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf? I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above. It could write in parallel but the second thread might not complete for myriad reasons. What then? And as you say, reading, the client would have to make reconciliation. Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)? Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests? Will that be sufficient data integrity-wise? St.Ack Thanks, St.Ack
-
RE: commit semanticsKannan Muthukkaruppan 2010-01-12, 19:40
Btw, is there much gains in having a large number of regions-- i.e. to the tune of 500 -- per region server?
I understand that having multiple regions per region server allows finer grained rebalancing when new nodes are added or a node goes down. But would say having a smaller number of regions per region server (say ~50) be really bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. Are there other advantages of having a large number of regions per region server? regards, Kannan -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jean-Daniel Cryans Sent: Tuesday, January 12, 2010 9:42 AM To: [EMAIL PROTECTED] Subject: Re: commit semantics wrt 1 HLog per region server, this is from the Bigtable paper. Their main concern is the number of opened files since if you have 1000 region servers * 500 regions then you may have 100 000 HLogs to manage. Also you can have more than one file per HLog, so let's say you have on average 5 log files per HLog that's 500 000 files on HDFS. J-D On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > thanks for ur response. > >>Right now each regionserver has 1 log, so if 2 puts on different >>tables hit the same RS, they hit the same HLog. > > I understand. My point was that the application could insert the same record > into two different tables on two different Hbase instances on two different > piece of hardware. > > On a related note, can somebody explain what the tradeoff is if each region > has its own hlog? are you worried about the number of files in HDFS? or > maybe the number of sync-threads in the region server? Can multiple hlog > files provide faster region splits? > > >> I've thought about this issue quite a bit, and I think the sync every >> 1 rows combined with optional no-sync and low time sync() is the way >> to go. If you want to discuss this more in person, maybe we can meet >> up for brews or something. >> > > The group-commit thing I can understand. HDFS does a very similar thing. But > can you explain your alternative "sync every 1 rows combined with optional > no-sync and low time sync"? For those applications that have the natural > characteristics of updating only one row per logical operation, how can they > be sure that their data has reached some-sort-of-stable-storage unless they > sync after every row update? > > thanks, > dhruba >
-
Re: commit semanticsJean-Daniel Cryans 2010-01-12, 19:43
On Tue, Jan 12, 2010 at 11:29 AM, Kannan Muthukkaruppan
<[EMAIL PROTECTED]> wrote: > > For data integrity, going with group commits (batch commits) seems like a good option. My understanding of group commits as implemented in 0.21 is as follows: > > * We wait on acknowledging back to the client until the transaction has been synced to HDFS. Yes > > * Syncs are batched-a sync is called if the queue has enough transactions or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself. Nope. There is two kinds of group commit around that piece of code: 1) What you called batch commit, which is a configurable value (flushlogentries) that we have to append x amount of entries to trigger a sync. Clients don't hold until that syncs happens so a region server failure could lose some rows depending on the time between the last sync and the failure. If flushlogentries=100 and 99 entries are lying around for more than the timer's timeout (default 1 sec), the timer will force sync those entries. 2) Group commit happens at high concurrency and is only useful if a high number of clients are writing at the same time and that flushlogentries=1. What happens in the LogSyncer thread is that instead of calling sync() for every entry, we "group" the clients waiting on the previous sync and issue only 1 sync for all of them. In this case, when the call returns in the client, we are sure that the value is in HDFS. > > > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of stack > Sent: Tuesday, January 12, 2010 10:52 AM > To: [EMAIL PROTECTED] > Cc: Kannan Muthukkaruppan; Dhruba Borthakur > Subject: Re: commit semantics > > On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hi stack, > > I was meaning "what if the application inserted the same record into two > Hbase instances"? Of course, now the onus is on the appl to keep both of > them in sync and recover from any inconsistencies between them. > > Ok. Like your "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf? > > I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above. It could write in parallel but the second thread might not complete for myriad reasons. What then? And as you say, reading, the client would have to make reconciliation. > > Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)? > > Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests? Will that be sufficient data integrity-wise? > > St.Ack > > Thanks, > St.Ack >
-
Re: commit semanticsJean-Daniel Cryans 2010-01-12, 19:53
It's all very depending on the size of your data VS the size of your
cluster VS your usage pattern. Example: you have 50 regions on a RS and they are all filled at the same rate. The RS dies so the master has to split the logs of 50 regions before reassigning. Example2: you have 500 regions on a RS and only 1 is filled. When it dies, the master will only have 1 region to process. Since a planned optimization is to reassign regions that have no edits in any HLog (you have to have that knowledge prior to processing the files, maybe store that in zookeeper) right before log splitting, then you lose availability on 49 regions in this case. Nevertheless, splitting a small number of regions should be more efficient. Also more regions in general means more memory usage, possibly more opened files, and if your data should be served very fast, then a higher number of regions means more data to keep in memory. J-D On Tue, Jan 12, 2010 at 11:40 AM, Kannan Muthukkaruppan <[EMAIL PROTECTED]> wrote: > Btw, is there much gains in having a large number of regions-- i.e. to the tune of 500 -- per region server? > > I understand that having multiple regions per region server allows finer grained rebalancing when new nodes are added or a node goes down. But would say having a smaller number of regions per region server (say ~50) be really bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. Are there other advantages of having a large number of regions per region server? > > regards, > Kannan > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jean-Daniel Cryans > Sent: Tuesday, January 12, 2010 9:42 AM > To: [EMAIL PROTECTED] > Subject: Re: commit semantics > > wrt 1 HLog per region server, this is from the Bigtable paper. Their > main concern is the number of opened files since if you have 1000 > region servers * 500 regions then you may have 100 000 HLogs to > manage. Also you can have more than one file per HLog, so let's say > you have on average 5 log files per HLog that's 500 000 files on HDFS. > > J-D > > On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: >> Hi Ryan, >> >> thanks for ur response. >> >>>Right now each regionserver has 1 log, so if 2 puts on different >>>tables hit the same RS, they hit the same HLog. >> >> I understand. My point was that the application could insert the same record >> into two different tables on two different Hbase instances on two different >> piece of hardware. >> >> On a related note, can somebody explain what the tradeoff is if each region >> has its own hlog? are you worried about the number of files in HDFS? or >> maybe the number of sync-threads in the region server? Can multiple hlog >> files provide faster region splits? >> >> >>> I've thought about this issue quite a bit, and I think the sync every >>> 1 rows combined with optional no-sync and low time sync() is the way >>> to go. If you want to discuss this more in person, maybe we can meet >>> up for brews or something. >>> >> >> The group-commit thing I can understand. HDFS does a very similar thing. But >> can you explain your alternative "sync every 1 rows combined with optional >> no-sync and low time sync"? For those applications that have the natural >> characteristics of updating only one row per logical operation, how can they >> be sure that their data has reached some-sort-of-stable-storage unless they >> sync after every row update? >> >> thanks, >> dhruba >> >
-
RE: commit semanticsKannan Muthukkaruppan 2010-01-12, 20:10
Ok cool. Thanks for clarifying.
I think what I had in mind was a hybrid-- basically try to accumulate transactions up to a certain app configurable time window before sync'ing (& until the sync delay ack'ing the client). Just caught up on a earlier response from Joy on this as well. <<if the performance with setting of 1 doesn't work out - we may need an option to delay acks until actual syncs .. (most likely we would be able to compromise on latency to get higher throughput - but wouldn't be willing to compromise on data integrity)>> Yes, that's what I had in mind. Agree that this could be something we explore later if necessary. Regards, Kannan -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jean-Daniel Cryans Sent: Tuesday, January 12, 2010 11:44 AM To: [EMAIL PROTECTED] Subject: Re: commit semantics On Tue, Jan 12, 2010 at 11:29 AM, Kannan Muthukkaruppan <[EMAIL PROTECTED]> wrote: > > For data integrity, going with group commits (batch commits) seems like a good option. My understanding of group commits as implemented in 0.21 is as follows: > > * We wait on acknowledging back to the client until the transaction has been synced to HDFS. Yes > > * Syncs are batched-a sync is called if the queue has enough transactions or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself. Nope. There is two kinds of group commit around that piece of code: 1) What you called batch commit, which is a configurable value (flushlogentries) that we have to append x amount of entries to trigger a sync. Clients don't hold until that syncs happens so a region server failure could lose some rows depending on the time between the last sync and the failure. If flushlogentries=100 and 99 entries are lying around for more than the timer's timeout (default 1 sec), the timer will force sync those entries. 2) Group commit happens at high concurrency and is only useful if a high number of clients are writing at the same time and that flushlogentries=1. What happens in the LogSyncer thread is that instead of calling sync() for every entry, we "group" the clients waiting on the previous sync and issue only 1 sync for all of them. In this case, when the call returns in the client, we are sure that the value is in HDFS. > > > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of stack > Sent: Tuesday, January 12, 2010 10:52 AM > To: [EMAIL PROTECTED] > Cc: Kannan Muthukkaruppan; Dhruba Borthakur > Subject: Re: commit semantics > > On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hi stack, > > I was meaning "what if the application inserted the same record into two > Hbase instances"? Of course, now the onus is on the appl to keep both of > them in sync and recover from any inconsistencies between them. > > Ok. Like your "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf? > > I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above. It could write in parallel but the second thread might not complete for myriad reasons. What then? And as you say, reading, the client would have to make reconciliation. > > Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)? > > Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests? Will that be sufficient data integrity-wise? > > St.Ack > > Thanks, > St.Ack >
-
Re: commit semanticsAndrew Purtell 2010-01-12, 20:49
> But would say having a
> smaller number of regions per region server (say ~50) be really bad. Not at all. There are some (test) HBase deployments I know of that go pretty vertical, multiple TBs of disk on each node therefore wanting a high number of regions per region server to match that density. That may meet with operational success but it is architecturally suspect. I ran a test cluster once with > 1,000 regions per server on 25 servers, in the 0.19 timeframe. 0.20 is much better in terms of resource demand (less) and liveness (enormously improved), but I still wouldn't recommend it, unless your clients can wait for up to several minutes on blocked reads and writes to affected regions should a node go down. With that many regions per server, it stands to reason just about every client would be affected. The numbers I have for Google's canonical BigTable deployment are several years out of date but they go pretty far in the other direction -- about 100 regions per server is the target. I think it also depends on whether you intend to colocate TaskTrackers with the region servers. I presume you intend to run HBase region servers colocated with HDFS DataNodes. After you have a HBase cluster up for some number of hours, certainly ~24, background compaction will bring the HDFS blocks backing region data local to the server, generally. MapReduce tasks backed by HBase tables will see similar advantages of data locality that you are probably accustomed to with working with files in HDFS. If you mix storage and computation this way it makes sense to seek a balance between the amount of data stored on each node (number of regions being served) and the available computational resources (available CPU cores, time constraints (if any) on task execution). Even if you don't intend to do the above, it's possible that an overly high region density can negatively impact performance if too much I/O load is placed on average on each region server. Adding more servers to spread load would then likely help**. These considerations bias against hosting a very large number of regions per region server. - Andy **: I say likely because this presumes query and edit patterns have been guided as necessary through engineering to be widely distributed in the key space. You have to take some care to avoid hot regions. ----- Original Message ---- > From: Kannan Muthukkaruppan <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Tue, January 12, 2010 11:40:00 AM > Subject: RE: commit semantics > > Btw, is there much gains in having a large number of regions-- i.e. to the tune > of 500 -- per region server? > > I understand that having multiple regions per region server allows finer grained > rebalancing when new nodes are added or a node goes down. But would say having a > smaller number of regions per region server (say ~50) be really bad. If a region > server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as > 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. > Are there other advantages of having a large number of regions per region > server? > > regards, > Kannan > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Jean-Daniel > Cryans > Sent: Tuesday, January 12, 2010 9:42 AM > To: [EMAIL PROTECTED] > Subject: Re: commit semantics > > wrt 1 HLog per region server, this is from the Bigtable paper. Their > main concern is the number of opened files since if you have 1000 > region servers * 500 regions then you may have 100 000 HLogs to > manage. Also you can have more than one file per HLog, so let's say > you have on average 5 log files per HLog that's 500 000 files on HDFS. > > J-D > > On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur wrote: > > Hi Ryan, > > > > thanks for ur response. > > > >>Right now each regionserver has 1 log, so if 2 puts on different > >>tables hit the same RS, they hit the same HLog.
-
RE: commit semanticsKannan Muthukkaruppan 2010-01-12, 21:07
> I presume you intend to run HBase region servers
> colocated with HDFS DataNodes. Yes. --- Seems like we all generally agree that large number of regions per region server may not be the way to go. So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern? Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues? regards, Kannan -----Original Message----- From: Andrew Purtell [mailto:[EMAIL PROTECTED]] Sent: Tuesday, January 12, 2010 12:50 PM To: [EMAIL PROTECTED] Subject: Re: commit semantics > But would say having a > smaller number of regions per region server (say ~50) be really bad. Not at all. There are some (test) HBase deployments I know of that go pretty vertical, multiple TBs of disk on each node therefore wanting a high number of regions per region server to match that density. That may meet with operational success but it is architecturally suspect. I ran a test cluster once with > 1,000 regions per server on 25 servers, in the 0.19 timeframe. 0.20 is much better in terms of resource demand (less) and liveness (enormously improved), but I still wouldn't recommend it, unless your clients can wait for up to several minutes on blocked reads and writes to affected regions should a node go down. With that many regions per server, it stands to reason just about every client would be affected. The numbers I have for Google's canonical BigTable deployment are several years out of date but they go pretty far in the other direction -- about 100 regions per server is the target. I think it also depends on whether you intend to colocate TaskTrackers with the region servers. I presume you intend to run HBase region servers colocated with HDFS DataNodes. After you have a HBase cluster up for some number of hours, certainly ~24, background compaction will bring the HDFS blocks backing region data local to the server, generally. MapReduce tasks backed by HBase tables will see similar advantages of data locality that you are probably accustomed to with working with files in HDFS. If you mix storage and computation this way it makes sense to seek a balance between the amount of data stored on each node (number of regions being served) and the available computational resources (available CPU cores, time constraints (if any) on task execution). Even if you don't intend to do the above, it's possible that an overly high region density can negatively impact performance if too much I/O load is placed on average on each region server. Adding more servers to spread load would then likely help**. These considerations bias against hosting a very large number of regions per region server. - Andy **: I say likely because this presumes query and edit patterns have been guided as necessary through engineering to be widely distributed in the key space. You have to take some care to avoid hot regions. ----- Original Message ---- > From: Kannan Muthukkaruppan <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Tue, January 12, 2010 11:40:00 AM > Subject: RE: commit semantics > > Btw, is there much gains in having a large number of regions-- i.e. to the tune > of 500 -- per region server? > > I understand that having multiple regions per region server allows finer grained > rebalancing when new nodes are added or a node goes down. But would say having a > smaller number of regions per region server (say ~50) be really bad. If a region > server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as
-
Re: commit semanticsJean-Daniel Cryans 2010-01-12, 21:36
Even with 100 regions, times 1000 region servers, we talk about
potentially having 100 000 opened files instead of 1000 (and also we have to count every replica). I guess that an OS that was configured for such usage would be able to sustain it... You would have to watch that metric cluster-wide, get new nodes when needed, etc. Then you need to make sure that GC pauses won't block for too long to have a very low unavailability time. J-D On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan <[EMAIL PROTECTED]> wrote: >> I presume you intend to run HBase region servers >> colocated with HDFS DataNodes. > > Yes. > > --- > > Seems like we all generally agree that large number of regions per region server may not be the way to go. > > So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern? > > Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues? > > regards, > Kannan > -----Original Message----- > From: Andrew Purtell [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, January 12, 2010 12:50 PM > To: [EMAIL PROTECTED] > Subject: Re: commit semantics > >> But would say having a >> smaller number of regions per region server (say ~50) be really bad. > > Not at all. > > There are some (test) HBase deployments I know of that go pretty > vertical, multiple TBs of disk on each node therefore wanting a high > number of regions per region server to match that density. That may meet > with operational success but it is architecturally suspect. I ran a test > cluster once with > 1,000 regions per server on 25 servers, in the 0.19 > timeframe. 0.20 is much better in terms of resource demand (less) and > liveness (enormously improved), but I still wouldn't recommend it, > unless your clients can wait for up to several minutes on blocked reads > and writes to affected regions should a node go down. With that many > regions per server, it stands to reason just about every client would be > affected. > > The numbers I have for Google's canonical BigTable deployment are several > years out of date but they go pretty far in the other direction -- about > 100 regions per server is the target. > > I think it also depends on whether you intend to colocate TaskTrackers > with the region servers. I presume you intend to run HBase region servers > colocated with HDFS DataNodes. After you have a HBase cluster up for some > number of hours, certainly ~24, background compaction will bring the HDFS > blocks backing region data local to the server, generally. MapReduce > tasks backed by HBase tables will see similar advantages of data locality > that you are probably accustomed to with working with files in HDFS. If > you mix storage and computation this way it makes sense to seek a balance > between the amount of data stored on each node (number of regions being > served) and the available computational resources (available CPU cores, > time constraints (if any) on task execution). > > Even if you don't intend to do the above, it's possible that an overly > high region density can negatively impact performance if too much I/O > load is placed on average on each region server. Adding more servers to > spread load would then likely help**. > > These considerations bias against hosting a very large number of regions > per region server. > > - Andy > > **: I say likely because this presumes query and edit patterns have been > guided as necessary through engineering to be widely distributed in the > key space. You have to take some care to avoid hot regions.
-
Re: commit semanticsstack 2010-01-13, 01:23
On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
<[EMAIL PROTECTED]>wrote: > > Seems like we all generally agree that large number of regions per region > server may not be the way to go. > > What Andrew says. You could make regions bigger so more data per regionserver but same rough (small) number to redeploy on crash but the logs to replay will be correspondingly bigger taking longer to process > So coming back to Dhruba's question on having one commit log per region > instead of one commit log per region server. Is the number of HDFS files > open still a major concern? > Yes. From "Commit-log implementation" section of the BT paper: "If we kept the commit log for each tablet in a separate log file, a very large number of files would be written concurrently in GFS. Depending on the underlying file system implementation on each GFS server, these writes could cause a large number of disk seeks to write to the different physical log files. In addition, having separate log files per tablet also reduces the effectiveness of the group commit optimization, since groups would tend to be smaller. To fix these issues, we append mutations to a single commit log per tablet server, co-mingling mutations for different tablets in the same physical log file." Not knowing any better, we presume hdfs is kinda-like gfs. > > Is my understanding correct that unavailability window during region server > failover is large due to the time it takes to split the shared commit log > into a per region log? Yes, though truth be told, this area of hbase performance has had very little attention paid to it. There are things that we could do much better -- e.g. distributed split instead of threaded split inside in a single procss -- and ideas for making it so we can take on writes much sooner than we currently do; e.g. open regions immediately on new server before split completes. > Instead, if we always had per-region commit logs even in the normal mode of > operation, then the unavailability window would be minimized? It does > minimize the extent of batch/group commits you can do though-- since you can > only batch updates going to the same region. Any other gotchas/issues? Just those listed above. St.Ack
-
Re: commit semanticsstack 2010-01-13, 05:12
(Below is a note from Joydeep. Something about Joydeeps' messages are
requiring that I approve/disapprove them. For the message below, I his disapprove by mistake so am copying it here manually) ---------- Forwarded message ---------- From: Joydeep Sarma <[EMAIL PROTECTED]> To: [EMAIL PROTECTED], [EMAIL PROTECTED], Dhruba Borthakur < [EMAIL PROTECTED]> Date: Tue, 12 Jan 2010 15:39:05 -0800 Subject: Re: commit semantics btw - i followed up with Dhruba afterwards on this comment: > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a > pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands. i think this is likely to explain limited throughput with the default write queue threshold of 1. if the appends cannot make progress while one is waiting for the sync - then the write pipeline is going to be idle most of the time (with queue threshold of 1). i think it would be good to have the sync not block other writers on the file/pipeline. logically - it's not clear why it needs to (since the sync is just a wait for the completion as of some write transaction id - allowing new ones to be queued up subsequently). Joydeep
-
Re: commit semanticsstack 2010-01-13, 05:16
On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote:
> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a > > pending sync. "sync" in HDFS is a pretty heavyweight operation as it > stands. > > i think this is likely to explain limited throughput with the default > write queue threshold of 1. if the appends cannot make progress while > one is waiting for the sync - then the write pipeline is going to be > idle most of the time (with queue threshold of 1). > > i think it would be good to have the sync not block other writers on > the file/pipeline. logically - it's not clear why it needs to (since > the sync is just a wait for the completion as of some write > transaction id - allowing new ones to be queued up subsequently). Are you talking about internal to DFSClient Joydeep? Or some synchronization block up in hlog? St.Ack
-
Re: commit semanticsJoydeep Sarma 2010-01-13, 05:41
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1. leave it up to Dhruba to explain the details. On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: > >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it >> stands. >> >> i think this is likely to explain limited throughput with the default >> write queue threshold of 1. if the appends cannot make progress while >> one is waiting for the sync - then the write pipeline is going to be >> idle most of the time (with queue threshold of 1). >> >> i think it would be good to have the sync not block other writers on >> the file/pipeline. logically - it's not clear why it needs to (since >> the sync is just a wait for the completion as of some write >> transaction id - allowing new ones to be queued up subsequently). > > > Are you talking about internal to DFSClient Joydeep? Or some > synchronization block up in hlog? > > St.Ack >
-
Re: commit semanticsDhruba Borthakur 2010-01-13, 16:51
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks, dhruba On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > this is internal to the dfsclient. this would explain why performance > would suck with queue threshold of 1. > > leave it up to Dhruba to explain the details. > > On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: > > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: > > > >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a > >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it > >> stands. > >> > >> i think this is likely to explain limited throughput with the default > >> write queue threshold of 1. if the appends cannot make progress while > >> one is waiting for the sync - then the write pipeline is going to be > >> idle most of the time (with queue threshold of 1). > >> > >> i think it would be good to have the sync not block other writers on > >> the file/pipeline. logically - it's not clear why it needs to (since > >> the sync is just a wait for the completion as of some write > >> transaction id - allowing new ones to be queued up subsequently). > > > > > > Are you talking about internal to DFSClient Joydeep? Or some > > synchronization block up in hlog? > > > > St.Ack > > > -- Connect to me at http://www.facebook.com/dhruba
-
Re: commit semanticsJean-Daniel Cryans 2010-01-13, 17:56
That's great dhruba, I guess the sooner it could go in is 0.21.1?
J-D On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one. > > thanks, > dhruba > > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > >> this is internal to the dfsclient. this would explain why performance >> would suck with queue threshold of 1. >> >> leave it up to Dhruba to explain the details. >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: >> > >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it >> >> stands. >> >> >> >> i think this is likely to explain limited throughput with the default >> >> write queue threshold of 1. if the appends cannot make progress while >> >> one is waiting for the sync - then the write pipeline is going to be >> >> idle most of the time (with queue threshold of 1). >> >> >> >> i think it would be good to have the sync not block other writers on >> >> the file/pipeline. logically - it's not clear why it needs to (since >> >> the sync is just a wait for the completion as of some write >> >> transaction id - allowing new ones to be queued up subsequently). >> > >> > >> > Are you talking about internal to DFSClient Joydeep? Or some >> > synchronization block up in hlog? >> > >> > St.Ack >> > >> > > > > -- > Connect to me at http://www.facebook.com/dhruba >
-
Re: commit semanticsDhruba Borthakur 2010-01-13, 18:38
I will try to make a patch for it first. depending on the complexity of the
patch code, we can decide which release it can go in. thanks, dhruba On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > That's great dhruba, I guess the sooner it could go in is 0.21.1? > > J-D > > On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <[EMAIL PROTECTED]> > wrote: > > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one. > > > > thanks, > > dhruba > > > > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <[EMAIL PROTECTED]> > wrote: > > > >> this is internal to the dfsclient. this would explain why performance > >> would suck with queue threshold of 1. > >> > >> leave it up to Dhruba to explain the details. > >> > >> On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: > >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: > >> > > >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on > a > >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as > it > >> >> stands. > >> >> > >> >> i think this is likely to explain limited throughput with the default > >> >> write queue threshold of 1. if the appends cannot make progress while > >> >> one is waiting for the sync - then the write pipeline is going to be > >> >> idle most of the time (with queue threshold of 1). > >> >> > >> >> i think it would be good to have the sync not block other writers on > >> >> the file/pipeline. logically - it's not clear why it needs to (since > >> >> the sync is just a wait for the completion as of some write > >> >> transaction id - allowing new ones to be queued up subsequently). > >> > > >> > > >> > Are you talking about internal to DFSClient Joydeep? Or some > >> > synchronization block up in hlog? > >> > > >> > St.Ack > >> > > >> > > > > > > > > -- > > Connect to me at http://www.facebook.com/dhruba > > > -- Connect to me at http://www.facebook.com/dhruba
-
Re: commit semanticsJean-Daniel Cryans 2010-01-13, 18:40
I'll be happy to benchmark, we already have code to test the
multi-client hitting 1 region server case. J-D On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > I will try to make a patch for it first. depending on the complexity of the > patch code, we can decide which release it can go in. > > thanks, > dhruba > > On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> That's great dhruba, I guess the sooner it could go in is 0.21.1? >> >> J-D >> >> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <[EMAIL PROTECTED]> >> wrote: >> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one. >> > >> > thanks, >> > dhruba >> > >> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <[EMAIL PROTECTED]> >> wrote: >> > >> >> this is internal to the dfsclient. this would explain why performance >> >> would suck with queue threshold of 1. >> >> >> >> leave it up to Dhruba to explain the details. >> >> >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: >> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: >> >> > >> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on >> a >> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation as >> it >> >> >> stands. >> >> >> >> >> >> i think this is likely to explain limited throughput with the default >> >> >> write queue threshold of 1. if the appends cannot make progress while >> >> >> one is waiting for the sync - then the write pipeline is going to be >> >> >> idle most of the time (with queue threshold of 1). >> >> >> >> >> >> i think it would be good to have the sync not block other writers on >> >> >> the file/pipeline. logically - it's not clear why it needs to (since >> >> >> the sync is just a wait for the completion as of some write >> >> >> transaction id - allowing new ones to be queued up subsequently). >> >> > >> >> > >> >> > Are you talking about internal to DFSClient Joydeep? Or some >> >> > synchronization block up in hlog? >> >> > >> >> > St.Ack >> >> > >> >> >> > >> > >> > >> > -- >> > Connect to me at http://www.facebook.com/dhruba >> > >> > > > > -- > Connect to me at http://www.facebook.com/dhruba >
-
Re: commit semanticsDhruba Borthakur 2010-01-13, 18:43
Awesome, I will try to post a patch soon and will let you know as soon as I
have the first version ready. thanks, dhruba On Wed, Jan 13, 2010 at 10:40 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > I'll be happy to benchmark, we already have code to test the > multi-client hitting 1 region server case. > know > J-D > > On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <[EMAIL PROTECTED]> > wrote: > > I will try to make a patch for it first. depending on the complexity of > the > > patch code, we can decide which release it can go in. > > > > thanks, > > dhruba > > > > On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > >> That's great dhruba, I guess the sooner it could go in is 0.21.1? > >> > >> J-D > >> > >> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <[EMAIL PROTECTED]> > >> wrote: > >> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one. > >> > > >> > thanks, > >> > dhruba > >> > > >> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> this is internal to the dfsclient. this would explain why performance > >> >> would suck with queue threshold of 1. > >> >> > >> >> leave it up to Dhruba to explain the details. > >> >> > >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: > >> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: > >> >> > > >> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked > on > >> a > >> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation > as > >> it > >> >> >> stands. > >> >> >> > >> >> >> i think this is likely to explain limited throughput with the > default > >> >> >> write queue threshold of 1. if the appends cannot make progress > while > >> >> >> one is waiting for the sync - then the write pipeline is going to > be > >> >> >> idle most of the time (with queue threshold of 1). > >> >> >> > >> >> >> i think it would be good to have the sync not block other writers > on > >> >> >> the file/pipeline. logically - it's not clear why it needs to > (since > >> >> >> the sync is just a wait for the completion as of some write > >> >> >> transaction id - allowing new ones to be queued up subsequently). > >> >> > > >> >> > > >> >> > Are you talking about internal to DFSClient Joydeep? Or some > >> >> > synchronization block up in hlog? > >> >> > > >> >> > St.Ack > >> >> > > >> >> > >> > > >> > > >> > > >> > -- > >> > Connect to me at http://www.facebook.com/dhruba > >> > > >> > > > > > > > > -- > > Connect to me at http://www.facebook.com/dhruba > > > -- Connect to me at http://www.facebook.com/dhruba
-
Re: commit semanticsJoydeep Sarma 2010-01-13, 19:01
i posted on the jira as well - but we should be able to simulate the
effect of the patch. if the sync was simulated merely a sleep (for 2-3ms - whatever is the average RTT for dfs write pipeline) instead of an actual call into dfs client - it should simulate the effect of the patch. (the appends would proceed in parallel, each sync would block for sometime). so we should be able to test whether this gets a performance win for the queue threshold=1 case. On Wed, Jan 13, 2010 at 10:43 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: > Awesome, I will try to post a patch soon and will let you know as soon as I > have the first version ready. > > thanks, > dhruba > > > On Wed, Jan 13, 2010 at 10:40 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> I'll be happy to benchmark, we already have code to test the >> multi-client hitting 1 region server case. >> know >> J-D >> >> On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <[EMAIL PROTECTED]> >> wrote: >> > I will try to make a patch for it first. depending on the complexity of >> the >> > patch code, we can decide which release it can go in. >> > >> > thanks, >> > dhruba >> > >> > On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <[EMAIL PROTECTED] >> >wrote: >> > >> >> That's great dhruba, I guess the sooner it could go in is 0.21.1? >> >> >> >> J-D >> >> >> >> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <[EMAIL PROTECTED]> >> >> wrote: >> >> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one. >> >> > >> >> > thanks, >> >> > dhruba >> >> > >> >> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <[EMAIL PROTECTED]> >> >> wrote: >> >> > >> >> >> this is internal to the dfsclient. this would explain why performance >> >> >> would suck with queue threshold of 1. >> >> >> >> >> >> leave it up to Dhruba to explain the details. >> >> >> >> >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: >> >> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: >> >> >> > >> >> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked >> on >> >> a >> >> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation >> as >> >> it >> >> >> >> stands. >> >> >> >> >> >> >> >> i think this is likely to explain limited throughput with the >> default >> >> >> >> write queue threshold of 1. if the appends cannot make progress >> while >> >> >> >> one is waiting for the sync - then the write pipeline is going to >> be >> >> >> >> idle most of the time (with queue threshold of 1). >> >> >> >> >> >> >> >> i think it would be good to have the sync not block other writers >> on >> >> >> >> the file/pipeline. logically - it's not clear why it needs to >> (since >> >> >> >> the sync is just a wait for the completion as of some write >> >> >> >> transaction id - allowing new ones to be queued up subsequently). >> >> >> > >> >> >> > >> >> >> > Are you talking about internal to DFSClient Joydeep? Or some >> >> >> > synchronization block up in hlog? >> >> >> > >> >> >> > St.Ack >> >> >> > >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Connect to me at http://www.facebook.com/dhruba >> >> > >> >> >> > >> > >> > >> > -- >> > Connect to me at http://www.facebook.com/dhruba >> > >> > > > > -- > Connect to me at http://www.facebook.com/dhruba >
-
Re: commit semanticsJean-Daniel Cryans 2010-01-13, 22:56
Good idea, let me try it.
J-D On Wed, Jan 13, 2010 at 11:01 AM, Joydeep Sarma <[EMAIL PROTECTED]> wrote: > i posted on the jira as well - but we should be able to simulate the > effect of the patch. > > if the sync was simulated merely a sleep (for 2-3ms - whatever is the > average RTT for dfs write pipeline) instead of an actual call into dfs > client - it should simulate the effect of the patch. (the appends > would proceed in parallel, each sync would block for sometime). > > so we should be able to test whether this gets a performance win for > the queue threshold=1 case. > > On Wed, Jan 13, 2010 at 10:43 AM, Dhruba Borthakur <[EMAIL PROTECTED]> wrote: >> Awesome, I will try to post a patch soon and will let you know as soon as I >> have the first version ready. >> >> thanks, >> dhruba >> >> >> On Wed, Jan 13, 2010 at 10:40 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: >> >>> I'll be happy to benchmark, we already have code to test the >>> multi-client hitting 1 region server case. >>> know >>> J-D >>> >>> On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur <[EMAIL PROTECTED]> >>> wrote: >>> > I will try to make a patch for it first. depending on the complexity of >>> the >>> > patch code, we can decide which release it can go in. >>> > >>> > thanks, >>> > dhruba >>> > >>> > On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans <[EMAIL PROTECTED] >>> >wrote: >>> > >>> >> That's great dhruba, I guess the sooner it could go in is 0.21.1? >>> >> >>> >> J-D >>> >> >>> >> On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur <[EMAIL PROTECTED]> >>> >> wrote: >>> >> > I opened http://issues.apache.org/jira/browse/HDFS-895 for this one. >>> >> > >>> >> > thanks, >>> >> > dhruba >>> >> > >>> >> > On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma <[EMAIL PROTECTED]> >>> >> wrote: >>> >> > >>> >> >> this is internal to the dfsclient. this would explain why performance >>> >> >> would suck with queue threshold of 1. >>> >> >> >>> >> >> leave it up to Dhruba to explain the details. >>> >> >> >>> >> >> On Tue, Jan 12, 2010 at 9:16 PM, stack <[EMAIL PROTECTED]> wrote: >>> >> >> > On Tue, Jan 12, 2010 at 9:12 PM, stack <[EMAIL PROTECTED]> wrote: >>> >> >> > >>> >> >> >> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked >>> on >>> >> a >>> >> >> >> > pending sync. "sync" in HDFS is a pretty heavyweight operation >>> as >>> >> it >>> >> >> >> stands. >>> >> >> >> >>> >> >> >> i think this is likely to explain limited throughput with the >>> default >>> >> >> >> write queue threshold of 1. if the appends cannot make progress >>> while >>> >> >> >> one is waiting for the sync - then the write pipeline is going to >>> be >>> >> >> >> idle most of the time (with queue threshold of 1). >>> >> >> >> >>> >> >> >> i think it would be good to have the sync not block other writers >>> on >>> >> >> >> the file/pipeline. logically - it's not clear why it needs to >>> (since >>> >> >> >> the sync is just a wait for the completion as of some write >>> >> >> >> transaction id - allowing new ones to be queued up subsequently). >>> >> >> > >>> >> >> > >>> >> >> > Are you talking about internal to DFSClient Joydeep? Or some >>> >> >> > synchronization block up in hlog? >>> >> >> > >>> >> >> > St.Ack >>> >> >> > >>> >> >> >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Connect to me at http://www.facebook.com/dhruba >>> >> > >>> >> >>> > >>> > >>> > >>> > -- >>> > Connect to me at http://www.facebook.com/dhruba >>> > >>> >> >> >> >> -- >> Connect to me at http://www.facebook.com/dhruba >> > |