|
Matt Corgan
2011-06-02, 22:28
Todd Lipcon
2011-06-02, 22:37
Jason Rutherglen
2011-06-02, 23:25
Jonathan Gray
2011-06-02, 23:44
Matt Corgan
2011-06-03, 03:17
Jason Rutherglen
2011-06-03, 03:28
Stack
2011-06-03, 04:48
Stack
2011-06-03, 04:51
Todd Lipcon
2011-06-03, 04:56
Matt Corgan
2011-06-04, 02:03
Jason Rutherglen
2011-06-04, 03:42
Jason Rutherglen
2011-06-04, 03:51
Matt Corgan
2011-06-04, 03:56
Jason Rutherglen
2011-06-04, 04:01
Matt Corgan
2011-06-04, 04:57
Stack
2011-06-04, 05:57
Jason Rutherglen
2011-06-04, 07:18
Jason Rutherglen
2011-06-04, 20:10
Stack
2011-06-04, 20:58
|
-
prefix compressionMatt Corgan 2011-06-02, 22:28
I've been reading up on prefix compression for a few weeks trying to figure
out the best approach. Maybe some of you have noticed that 90% of my emails to the user list have something to do with it... Now that people are working on HFile v2 and the topic of Lucene's FST has come up, I thought it would be a good time to share some thoughts. My goal here is definitely not asking anyone to implement this, but just to start a discussion that gets the proper jiras in place. I have spent a little time coding an initial rowKey trie builder that could be the start of a pluggable implementation. This is a big topic to address in one email, but I don't think the pieces make sense on their own. Here's an outline with some thoughts: * refer to prefix compression as "compaction" to avoid interfering with traditional "compression" * main goal is to reduce size of data in memory to increase effective capacity of block index, block cache, and memstores * disk access takes thousands of times longer than memory access, especially over hdfs/network * secondary goal is to speed up the processing of in memory data * memory access takes 200+ cpu cycles, so must use cache friendly algorithms * structure trie node sizes around typical 64B cache line size * investigate java's memory prefetching capabilities, possibly increasing to ~256B nodes * research paper<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>about modern memory databases and prefetching * almost always use variable length integer/long encoding * need separate compacted structures/algorithms for rowKey, colQualifier, and timestamp * structures will need to interact with varInt pointers to link rowKey<->colQualifier<->timestamp * i think in the medium to long term (and maybe even short term) it would be best to code this from scratch Static row key compaction for block cache and data blocks: * avoid schemes where each row is a diff of the previous (expensive cpu) * utilize the fact that row keys are always sorted (unlike column qualifiers) * this makes for fast tree construction during flushing and compaction * low hanging fruit: (rowKeyCompaction=simple) * pull off the common prefix of all rows * have a flag indicating if row key is repeat of the previous * more aggressive: use modified Cache Sensitive Search Trees<http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998/cucs-019-98.pdf> (rowKeyCompaction=cssTree64, cssTree256, etc) * modified to accept full byte range 0-255, store full byte strings in the tree, remove all pointers, optimize for cache line size, etc, but similar idea * CSS trees are considered sub-par because they are not easily modified, but that is ok in hbase's block cache and data blocks * input is a sorted list of byte[], output is a byte[] containing the entire prefix trie * possibly put all keys in the front of the block, and all values at the end, with value offsets/lengths in the keys Backwards compatibility * block cache could quickly compute the trie when loaded, therefore working with HFile v1 * eventually want to store the computed trie on disk * HFile v2 will need to store information about the pluggable data block compaction schemes Memstore dynamic row key compaction: * this will be much more difficult than the static compaction because it needs to accept unsorted byte[]'s and support row locking * use modified C-Burstsort<http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf> for dynamic unsorted data * fast and compact (best dynamic string sorting algorithm i could find) * fragmentation friendly (not original goal, but ends up working like MSLAB) * allocates large (customizable) byte[] for holding multiple leaf values * "bursts" the byte[] when they fill up, creating multiple child arrays * will need to add some sort of lockable object to the leaf of the rowKey trie Column qualifier compaction: * config param colQualifierCompaction=(none, lookupTable, cssTree64, etc) * use lookupTable when column qualifiers are a small bounded set * store the lookup table in memory (similar to blockIndex or fixedFileTrailer) * on second thought, this could be difficult without 2-pass compaction, but maybe still possible * use cssTree when a large or unbounded set, like when appending a numerical suffix in wide rows Timestamp compaction: * timestampCompaction=(none, hfileDiff, cssTree, flatten) * each hfile needs a lowestTimestamp stored in the trailer * hflieDiff would store a varLong relative to the lowestTimestamp * cssTree would do the same as it does for rowKeys, but the trie overhead might wipe out the space savings with such small values * flatten would not store a timestamp in each record, rather use the hfile's lowestTimestamp for every record * many use cases don't care about the timestamp * i think this works for maxVersions=1. may need more elaborate scheme for multiple versions I'll stop there... sorry for the huge email! I haven't seen much discussion on email or jira about how prefix compression should be done, so I hope this provides somewhat of a starting point. Matt
-
Re: prefix compressionTodd Lipcon 2011-06-02, 22:37
Hey Matt,
Interesting email, and also something I've been thinking about recently. Unfortunately, I think one of the big prerequisites before we can start thinking about actual compression algorithms is some refactoring around how KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of places - these byte arrays are expected to have discrete sub-ranges that represent row key, column qualifier, etc. Various parts of the code directly grab these byte arrays and offsets and do stuff with them (eg comparison, serialization, etc). In order to split the different components up, or do true prefix compression, we'd need to be able to have KeyValue do some smarter things for comparison, and never call functions like "getBuffer), getRowKeyOffset(), getRowKeyLength()" expecting that to show us a byte array range with a row key in it. I think Jonathan Gray has some code hacked together for this, but last I heard it wasn't in shippable state... Jonathan, you out there? As for the particular compression algorithms, a nice first step would be a very simplistic one: the first row key of any HFile block would be given in full. Following that, each key uses a single byte token which identifies the number of key components shared with the previous row (ie same row, same col, different ts). This isn't quite as good as true prefix compression, but for common cases where each cell has three versions, and each row has many columns, it's probably a huge savings. -Todd On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > I've been reading up on prefix compression for a few weeks trying to figure > out the best approach. Maybe some of you have noticed that 90% of my > emails > to the user list have something to do with it... Now that people are > working on HFile v2 and the topic of Lucene's FST has come up, I thought it > would be a good time to share some thoughts. My goal here is definitely > not > asking anyone to implement this, but just to start a discussion that gets > the proper jiras in place. I have spent a little time coding an initial > rowKey trie builder that could be the start of a pluggable implementation. > > This is a big topic to address in one email, but I don't think the pieces > make sense on their own. Here's an outline with some thoughts: > > * refer to prefix compression as "compaction" to avoid interfering with > traditional "compression" > > * main goal is to reduce size of data in memory to increase effective > capacity of block index, block cache, and memstores > * disk access takes thousands of times longer than memory access, > especially over hdfs/network > > * secondary goal is to speed up the processing of in memory data > * memory access takes 200+ cpu cycles, so must use cache friendly > algorithms > * structure trie node sizes around typical 64B cache line size > * investigate java's memory prefetching capabilities, possibly > increasing to ~256B nodes > * research > paper< > http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg > >about > modern memory databases and prefetching > * almost always use variable length integer/long encoding > > * need separate compacted structures/algorithms for rowKey, colQualifier, > and timestamp > * structures will need to interact with varInt pointers to link > rowKey<->colQualifier<->timestamp > > * i think in the medium to long term (and maybe even short term) it would > be > best to code this from scratch > > > Static row key compaction for block cache and data blocks: > * avoid schemes where each row is a diff of the previous (expensive cpu) > * utilize the fact that row keys are always sorted (unlike column > qualifiers) > * this makes for fast tree construction during flushing and compaction Todd Lipcon Software Engineer, Cloudera
-
Re: prefix compressionJason Rutherglen 2011-06-02, 23:25
> Memstore dynamic row key compaction
This is interesting though I think extremely difficult. We need this for Lucene realtime search with the RAM terms dictionary, which is slated to use a ConcurrentSkipListMap, eg it's lock free and it works. As Todd mentions, the KeyValue's need for access to the underlying byte[] does create some difficulties in regards to how rowids'd work. I think we could start with making the block index more efficient. And prefix compression would be a big win. Is there a Jira for that? Sounds like we need to make the HFile pluggable first regardless? On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > I've been reading up on prefix compression for a few weeks trying to figure > out the best approach. Maybe some of you have noticed that 90% of my emails > to the user list have something to do with it... Now that people are > working on HFile v2 and the topic of Lucene's FST has come up, I thought it > would be a good time to share some thoughts. My goal here is definitely not > asking anyone to implement this, but just to start a discussion that gets > the proper jiras in place. I have spent a little time coding an initial > rowKey trie builder that could be the start of a pluggable implementation. > > This is a big topic to address in one email, but I don't think the pieces > make sense on their own. Here's an outline with some thoughts: > > * refer to prefix compression as "compaction" to avoid interfering with > traditional "compression" > > * main goal is to reduce size of data in memory to increase effective > capacity of block index, block cache, and memstores > * disk access takes thousands of times longer than memory access, > especially over hdfs/network > > * secondary goal is to speed up the processing of in memory data > * memory access takes 200+ cpu cycles, so must use cache friendly > algorithms > * structure trie node sizes around typical 64B cache line size > * investigate java's memory prefetching capabilities, possibly > increasing to ~256B nodes > * research > paper<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>about > modern memory databases and prefetching > * almost always use variable length integer/long encoding > > * need separate compacted structures/algorithms for rowKey, colQualifier, > and timestamp > * structures will need to interact with varInt pointers to link > rowKey<->colQualifier<->timestamp > > * i think in the medium to long term (and maybe even short term) it would be > best to code this from scratch > > > Static row key compaction for block cache and data blocks: > * avoid schemes where each row is a diff of the previous (expensive cpu) > * utilize the fact that row keys are always sorted (unlike column > qualifiers) > * this makes for fast tree construction during flushing and compaction > * low hanging fruit: (rowKeyCompaction=simple) > * pull off the common prefix of all rows > * have a flag indicating if row key is repeat of the previous > * more aggressive: use modified Cache Sensitive Search > Trees<http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998/cucs-019-98.pdf> > (rowKeyCompaction=cssTree64, > cssTree256, etc) > * modified to accept full byte range 0-255, store full byte strings in > the tree, remove all pointers, optimize for cache line size, etc, but > similar idea > * CSS trees are considered sub-par because they are not easily modified, > but that is ok in hbase's block cache and data blocks > * input is a sorted list of byte[], output is a byte[] containing the entire > prefix trie > * possibly put all keys in the front of the block, and all values at the > end, with value offsets/lengths in the keys >
-
RE: prefix compressionJonathan Gray 2011-06-02, 23:44
I'm here!
Still parsing through the stuff in this e-mail but I agree that many approaches will require rejiggering of KeyValue in a significant way. I have made some attempts at this but nothing is very far. It's definitely something that we are hoping to put some resources into over the summer. JG > -----Original Message----- > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > Sent: Thursday, June 02, 2011 3:37 PM > To: [EMAIL PROTECTED] > Subject: Re: prefix compression > > Hey Matt, > > Interesting email, and also something I've been thinking about recently. > > Unfortunately, I think one of the big prerequisites before we can start > thinking about actual compression algorithms is some refactoring around > how KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of > places > - these byte arrays are expected to have discrete sub-ranges that represent > row key, column qualifier, etc. Various parts of the code directly grab these > byte arrays and offsets and do stuff with them (eg comparison, serialization, > etc). > > In order to split the different components up, or do true prefix compression, > we'd need to be able to have KeyValue do some smarter things for > comparison, and never call functions like "getBuffer), getRowKeyOffset(), > getRowKeyLength()" expecting that to show us a byte array range with a row > key in it. > > I think Jonathan Gray has some code hacked together for this, but last I > heard it wasn't in shippable state... Jonathan, you out there? > > As for the particular compression algorithms, a nice first step would be a very > simplistic one: the first row key of any HFile block would be given in full. > Following that, each key uses a single byte token which identifies the > number of key components shared with the previous row (ie same row, > same col, different ts). This isn't quite as good as true prefix compression, > but for common cases where each cell has three versions, and each row has > many columns, it's probably a huge savings. > > -Todd > > On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > > > I've been reading up on prefix compression for a few weeks trying to > > figure out the best approach. Maybe some of you have noticed that 90% > > of my emails to the user list have something to do with it... Now > > that people are working on HFile v2 and the topic of Lucene's FST has > > come up, I thought it would be a good time to share some thoughts. My > > goal here is definitely not asking anyone to implement this, but just > > to start a discussion that gets the proper jiras in place. I have > > spent a little time coding an initial rowKey trie builder that could > > be the start of a pluggable implementation. > > > > This is a big topic to address in one email, but I don't think the > > pieces make sense on their own. Here's an outline with some thoughts: > > > > * refer to prefix compression as "compaction" to avoid interfering > > with traditional "compression" > > > > * main goal is to reduce size of data in memory to increase effective > > capacity of block index, block cache, and memstores > > * disk access takes thousands of times longer than memory access, > > especially over hdfs/network > > > > * secondary goal is to speed up the processing of in memory data > > * memory access takes 200+ cpu cycles, so must use cache friendly > > algorithms > > * structure trie node sizes around typical 64B cache line size > > * investigate java's memory prefetching capabilities, possibly > > increasing to ~256B nodes > > * research > > paper< > > > http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFj > AE&url > > > =http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3 > D10.1. > > > 1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetchi > ng&ei > > =7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1- > UG0x6ZsqsIHLqvGDA&sig2=o0 > > 96l8cNrnX69oeBGChUKg > > >about > > modern memory databases and prefetching
-
Re: prefix compressionMatt Corgan 2011-06-03, 03:17
What about turning KeyValue into an interface with only the essential
getXX() methods? Whatever is using the KeyValue shouldn't *have* to know what sort of structure it's backed by, but could figure out the implementation in cases where it wants to optimize performance. Seems like this is warranted right now just for the sake of breaking up such a huge class... Possible implementations include: new SimpleKeyValue(final byte[] row, final byte[] family, final byte[] qualifier, final long timestamp, Type type, final byte[] value) (separate fields, low performance, useful for debugging?) new ArrayKeyValue(byte[] bytes) new SubArrayKeyValue(byte[] bytes, int offset, int length) new BlockIndexKeyValue(BlockIndex blockIndex, int kvPosition) new HFileKeyValue(HFileScanner scanner) etc... where each implementation knows how to fish out the underlying data. Later we add more implementations: new CssTreeBlockIndexKeyValue(CssTreeBlockIndex blockIndex, int kvPosition) @Jason - i agree about starting with the block index since it's a relatively isolated piece of code. Could possibly be done without KeyValue modifications. Also, i think it could work without changing the HFile if people are willing to let it compact the index when the HFile is read rather than when the when it's written. Would make for slightly slower startup times but would save a lot of memory. Could be configurable. On Thu, Jun 2, 2011 at 4:44 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote: > I'm here! > > Still parsing through the stuff in this e-mail but I agree that many > approaches will require rejiggering of KeyValue in a significant way. I > have made some attempts at this but nothing is very far. It's definitely > something that we are hoping to put some resources into over the summer. > > JG > > > -----Original Message----- > > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, June 02, 2011 3:37 PM > > To: [EMAIL PROTECTED] > > Subject: Re: prefix compression > > > > Hey Matt, > > > > Interesting email, and also something I've been thinking about recently. > > > > Unfortunately, I think one of the big prerequisites before we can start > > thinking about actual compression algorithms is some refactoring around > > how KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of > > places > > - these byte arrays are expected to have discrete sub-ranges that > represent > > row key, column qualifier, etc. Various parts of the code directly grab > these > > byte arrays and offsets and do stuff with them (eg comparison, > serialization, > > etc). > > > > In order to split the different components up, or do true prefix > compression, > > we'd need to be able to have KeyValue do some smarter things for > > comparison, and never call functions like "getBuffer), getRowKeyOffset(), > > getRowKeyLength()" expecting that to show us a byte array range with a > row > > key in it. > > > > I think Jonathan Gray has some code hacked together for this, but last I > > heard it wasn't in shippable state... Jonathan, you out there? > > > > As for the particular compression algorithms, a nice first step would be > a very > > simplistic one: the first row key of any HFile block would be given in > full. > > Following that, each key uses a single byte token which identifies the > > number of key components shared with the previous row (ie same row, > > same col, different ts). This isn't quite as good as true prefix > compression, > > but for common cases where each cell has three versions, and each row has > > many columns, it's probably a huge savings. > > > > -Todd > > > > On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[EMAIL PROTECTED]> > > wrote: > > > > > I've been reading up on prefix compression for a few weeks trying to > > > figure out the best approach. Maybe some of you have noticed that 90% > > > of my emails to the user list have something to do with it... Now > > > that people are working on HFile v2 and the topic of Lucene's FST has
-
Re: prefix compressionJason Rutherglen 2011-06-03, 03:28
> i agree about starting with the block index since it's a relatively
> isolated piece of code. Could possibly be done without KeyValue > modifications Right. I don't think any KeyValue changes will be required. I'm trying to get more info about if we can MMap the FST, then store all the keys into it on disk/HDFS. This could completely and insanely reduce the cost of rowids both in terms of loading them into heap, heap usage, and the overall size of HFiles in general. Also it would enable the user to choose keys of a much larger size and not pay a penalty. On Thu, Jun 2, 2011 at 8:17 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > What about turning KeyValue into an interface with only the essential > getXX() methods? Whatever is using the KeyValue shouldn't *have* to know > what sort of structure it's backed by, but could figure out the > implementation in cases where it wants to optimize performance. Seems like > this is warranted right now just for the sake of breaking up such a huge > class... Possible implementations include: > > new SimpleKeyValue(final byte[] row, final byte[] family, > final byte[] qualifier, final long timestamp, Type type, > final byte[] value) (separate fields, low performance, useful for > debugging?) > > new ArrayKeyValue(byte[] bytes) > > new SubArrayKeyValue(byte[] bytes, int offset, int length) > > new BlockIndexKeyValue(BlockIndex blockIndex, int kvPosition) > > new HFileKeyValue(HFileScanner scanner) > > etc... where each implementation knows how to fish out the underlying data. > Later we add more implementations: > > new CssTreeBlockIndexKeyValue(CssTreeBlockIndex blockIndex, int kvPosition) > > > @Jason - i agree about starting with the block index since it's a relatively > isolated piece of code. Could possibly be done without KeyValue > modifications. Also, i think it could work without changing the HFile if > people are willing to let it compact the index when the HFile is read rather > than when the when it's written. Would make for slightly slower startup > times but would save a lot of memory. Could be configurable. > > > On Thu, Jun 2, 2011 at 4:44 PM, Jonathan Gray <[EMAIL PROTECTED]> wrote: > >> I'm here! >> >> Still parsing through the stuff in this e-mail but I agree that many >> approaches will require rejiggering of KeyValue in a significant way. I >> have made some attempts at this but nothing is very far. It's definitely >> something that we are hoping to put some resources into over the summer. >> >> JG >> >> > -----Original Message----- >> > From: Todd Lipcon [mailto:[EMAIL PROTECTED]] >> > Sent: Thursday, June 02, 2011 3:37 PM >> > To: [EMAIL PROTECTED] >> > Subject: Re: prefix compression >> > >> > Hey Matt, >> > >> > Interesting email, and also something I've been thinking about recently. >> > >> > Unfortunately, I think one of the big prerequisites before we can start >> > thinking about actual compression algorithms is some refactoring around >> > how KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of >> > places >> > - these byte arrays are expected to have discrete sub-ranges that >> represent >> > row key, column qualifier, etc. Various parts of the code directly grab >> these >> > byte arrays and offsets and do stuff with them (eg comparison, >> serialization, >> > etc). >> > >> > In order to split the different components up, or do true prefix >> compression, >> > we'd need to be able to have KeyValue do some smarter things for >> > comparison, and never call functions like "getBuffer), getRowKeyOffset(), >> > getRowKeyLength()" expecting that to show us a byte array range with a >> row >> > key in it. >> > >> > I think Jonathan Gray has some code hacked together for this, but last I >> > heard it wasn't in shippable state... Jonathan, you out there? >> > >> > As for the particular compression algorithms, a nice first step would be >> a very >> > simplistic one: the first row key of any HFile block would be given in >> full.
-
Re: prefix compressionStack 2011-06-03, 04:48
High-level this sounds like a great.
Inline below is some feedback and a bit of history on how we got here in case it helps: On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > * refer to prefix compression as "compaction" to avoid interfering with > traditional "compression" > 'Compaction' is already a term in hbase. Your suggestion below may reproduce an issue we currently have where the term 'split' can refer to the splitting of logs on regionserver recovery or spit of a region because its grown too large. Just a headsup. > * main goal is to reduce size of data in memory to increase effective > capacity of block index, block cache, and memstores Ok. At some cost in CPU. > * disk access takes thousands of times longer than memory access, > especially over hdfs/network > Yes. I would underline the clause that starts 'especially' in the above. > * secondary goal is to speed up the processing of in memory data > * memory access takes 200+ cpu cycles, so must use cache friendly > algorithms > * structure trie node sizes around typical 64B cache line size How would you measure our effectiveness here? > * investigate java's memory prefetching capabilities, possibly > increasing to ~256B nodes Do you have a pointer on this phenomeon (I'm ignorant and would learn more)? > * almost always use variable length integer/long encoding > In my limited experience I've found these to be CPU intensive making the parse of a compound byte array of rowkey+family+ etc. such as KV is now complicated. On other hand, our Benoit reports having made significant improvements over the hadoop varlength I was using. > * need separate compacted structures/algorithms for rowKey, colQualifier, > and timestamp Why you think that Matt? Wouldn't rowKey and say colQualifier share an algorithm because both are arbitrary byte arrays? Or, are you talking about context; the fact that the previous KV may have same row and family say? (We've talked of using a code for column family keeping a table of codes to family name...). > * structures will need to interact with varInt pointers to link > rowKey<->colQualifier<->timestamp > > * i think in the medium to long term (and maybe even short term) it would be > best to code this from scratch > I'd suggest some code sketches apart from hbase would be cool illustrating what you are thinking. KV is pervasive; from client API all the ways out to entries in HFile. Changing it is going to be a PITA. > > Static row key compaction for block cache and data blocks: > * avoid schemes where each row is a diff of the previous (expensive cpu) Yes. I think this is likely the case. > * utilize the fact that row keys are always sorted (unlike column > qualifiers) Column qualifiers are also sorted. > * this makes for fast tree construction during flushing and compaction Because you are filling the tree w/ sorted data? > * low hanging fruit: (rowKeyCompaction=simple) > * pull off the common prefix of all rows In a block context? As the context's change, would we have a different means of doing this (Examples of contexts: memstore, block cache, hfile). > * modified to accept full byte range 0-255, store full byte strings in > the tree, remove all pointers, optimize for cache line size, etc, but > similar idea How well can this be done in a language such as java? Does its indirection make it so its not possible to be cache line size consious? > * CSS trees are considered sub-par because they are not easily modified, > but that is ok in hbase's block cache and data blocks And for the memstore? Do you have an idea for what kind of structure we could use here, one that exploits 'compressed' values? > * input is a sorted list of byte[], output is a byte[] containing the entire > prefix trie > * possibly put all keys in the front of the block, and all values at the > end, with value offsets/lengths in the keys > How do you think this would improve things? I don't think we need to be backward compatible. We just need to be able to migrate from old to the new (in fact, I'd say do not be backward compatible -- having to do so in my experience makes the code more complex and inevitable compromises an implementation) Yes. As has been said v2 has a hierarchical index. That might be hard to make pluggable? What do you think? Row 'locking' is done using MVCC -- a sequence/edit number is added to KVs in-memory only and only edits with a sequence/edit number older than a current moving point are viewable. This mechanism or one like it could be used going forward/ Do we need too? (Thanks for the paper pointers) Yes. Should we be doing 64bit timestamps? Then it might not? Thanks for the provocative mail Matt. We arrived at KV because previously we had an Object of key, value and timestamp which we were copying multiple times on the way in and then again on the way. KV was an attempt at cutting down on the copies and object creation. At the time we thought byte arrays raw and fast. We looked at varint'ing it but savings didn't seem worth the bother or the extra CPU and it make parse of the KV byte array harder. St.Ack
-
Re: prefix compressionStack 2011-06-03, 04:51
On Thu, Jun 2, 2011 at 8:17 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> What about turning KeyValue into an interface with only the essential > getXX() methods? Whatever is using the KeyValue shouldn't *have* to know > what sort of structure it's backed by, but could figure out the > implementation in cases where it wants to optimize performance. A few of us have looked at doing this and baulked. Maybe if we'd had a stiff drink first.... ... > new CssTreeBlockIndexKeyValue(CssTreeBlockIndex blockIndex, int kvPosition) > Sounds great. St.Ack
-
Re: prefix compressionTodd Lipcon 2011-06-03, 04:56
On Thu, Jun 2, 2011 at 9:51 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Thu, Jun 2, 2011 at 8:17 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > What about turning KeyValue into an interface with only the essential > > getXX() methods? Whatever is using the KeyValue shouldn't *have* to know > > what sort of structure it's backed by, but could figure out the > > implementation in cases where it wants to optimize performance. > > A few of us have looked at doing this and baulked. Maybe if we'd had > a stiff drink first.... > I gave it a go couple months ago on a lazy Saturday. Spent about 2 hours, came out with a huge mess and an HBase that didn't compile. Best of luck to you if you choose to fight this dragon :) We _could_ attempt to do it somewhat incrementally - eg first deprecate "getBuffer" and then slowly remove all the calls. I imagine 95% of the calls fall into the category of a) comparison or b) serialization onto wire/file. We can replace A with calls into KVComparator, and B with calls to a new "write(Column|TS|RowKey|...)" call. If we can get that down to 0, then "getBuffer" becomes private only accessible to KVComparator, and we move onto the next call that eludes interfacedom. ... > > new CssTreeBlockIndexKeyValue(CssTreeBlockIndex blockIndex, int > kvPosition) > > > > Sounds great. > > St.Ack > -- Todd Lipcon Software Engineer, Cloudera
-
Re: prefix compressionMatt Corgan 2011-06-04, 02:03
Thanks for the feedback Stack. Some inline responses:
On Thu, Jun 2, 2011 at 9:48 PM, Stack <[EMAIL PROTECTED]> wrote: > High-level this sounds like a great. > > Inline below is some feedback and a bit of history on how we got here > in case it helps: > > On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > * refer to prefix compression as "compaction" to avoid interfering with > > traditional "compression" > > > > 'Compaction' is already a term in hbase. Your suggestion below may > reproduce an issue we currently have where the term 'split' can refer > to the splitting of logs on regionserver recovery or spit of a region > because its grown too large. Just a headsup. > > ha, yes, not sure how i overlooked that. Then maybe name the config params: rowKeyFormat, colQualifierFormat, and timestampFormat > > > * main goal is to reduce size of data in memory to increase effective > > capacity of block index, block cache, and memstores > > > Ok. At some cost in CPU. > > > > * disk access takes thousands of times longer than memory access, > > especially over hdfs/network > > > > Yes. I would underline the clause that starts 'especially' in the above. > > > > * secondary goal is to speed up the processing of in memory data > > * memory access takes 200+ cpu cycles, so must use cache friendly > > algorithms > > * structure trie node sizes around typical 64B cache line size > > How would you measure our effectiveness here? > The most important thing is to make the format pluggable, then we can run HFile performance tests to find the optimal tuning for different hardware/workloads. The idea with the 64B is that if you're going to go to memory and get 64B (or 8 words) back, you might as well have designed your data structure so that you can make use of all 64B. This requires that you structure the prefix tries a little differently than older literature recommends. > > > > * investigate java's memory prefetching capabilities, possibly > > increasing to ~256B nodes > > Do you have a pointer on this phenomeon (I'm ignorant and would learn > more)? > > Paying attention to pre-fetching would be a super-optimization, but it is a very real thing that can buy you another ~20% performance. 64B is the normal memory fetch size right now, but sometimes the hardware will bring back even larger chunks. Some languages offer specific commands to prefetch... java doesn't, but the compiler is adding them behind the scenes so it could possibly be coerced. The best reading on the topic is probably the paper i mentioned earlier (link<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>), and then its references if you're still interested. > > > * almost always use variable length integer/long encoding > > > > In my limited experience I've found these to be CPU intensive making > the parse of a compound byte array of rowkey+family+ etc. such as KV > is now complicated. > > On other hand, our Benoit reports having made significant improvements > over the hadoop varlength I was using. > > Pluggable formats would help here so you could tune for mem vs cpu. My thought is that most hbase vints would be very short (1-2 byte internal array offsets) which are faster to decode, but more importantly, save a lot of memory. This topic is covered a lot in discussions of inverted index posting list compression. Not perfectly relevant, but here's a good paper<http://www2008.org/papers/pdf/p387-zhangA.pdf>about compressing integers/longs. I have no data to back it up, but I feel like if search engines are primarily storing lists of longs and they decide it's best to compress them, then that's probably the best method for other pieces of software as well. > > * need separate compacted structures/algorithms for rowKey, colQualifier, All three parts of the KV could share the same trie algorithm for compressing arbitrary byte arrays. However, I think sometimes you can compress colQualifiers even further (and faster) with a lookup table. And timestamps even further with a delta scheme or by completely omitting them. Agreed - i'm working on some simple code examples in my free time. Hopefully can post in a couple weeks. data block. If you take a data block and split it into 3 parallel lists of equal length: a list of rowKeys, a list of colQualifiers, and a list of timestamps, the rowKeys will be automatically be sorted, but the other two will not be (unless the whole block is contained in one row). Does that make sense? For memstore and block index, you take the common prefix of the first and last rowKey. For data block, take the common prefix of first and last key. As tables get really large, they are more and more likely to have longer common prefixes. I think it's the hardware that determines the cache line size and does some of the prefetching, so java is directly affected by those. Unfortunately, I'm not sure it's possible to align the start of an array on a cache line in java like it is in C/C++, however, on average you can make use of half the cache line... and there's always hope for the future. Not sure it's always best to do this, but the idea is to not pollute the cache with value data unless it's actually needed. Think of the case of a scanner with a high block cache hit ratio but returning sparse results. This may be an improvement on the case where prefix compression isn't needed to begin with. Another case for pluggable formats and testing... Agree about backwards compatibility being a bad idea (especially in a pre-1.0 product). I guess I meant "current compatibility", like it could be included in 0.90.4 since it doesn't require changes in serialization or file formats. Not pushing for that,
-
Re: prefix compressionJason Rutherglen 2011-06-04, 03:42
Here's a nice preliminary number with the FST, 50 million dates of the
form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The FST is 984 bytes, with an incrementing long to point to the presumably MMap'd value data. This's a bit crazy. Perhaps we should try other increments as well? Given that HBase keys especially are probably close increments of each other, I think the FST can always be loaded into RAM with pointers out to the actual values.
-
Re: prefix compressionJason Rutherglen 2011-06-04, 03:51
Also the next thing to measure with the FST is the key lookup speed.
I'm not sure what that'd look like, or how to compare with HBase right now? On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > Here's a nice preliminary number with the FST, 50 million dates of the > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The > FST is 984 bytes, with an incrementing long to point to the presumably > MMap'd value data. This's a bit crazy. > > Perhaps we should try other increments as well? Given that HBase keys > especially are probably close increments of each other, I think the > FST can always be loaded into RAM with pointers out to the actual > values. >
-
Re: prefix compressionMatt Corgan 2011-06-04, 03:56
Jason - are you feeding it that whole string for each date? Input data is
17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? Is it possible to compress by that much? Maybe I'm missing something about how the FST works. Matt On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <[EMAIL PROTECTED] > wrote: > Also the next thing to measure with the FST is the key lookup speed. > I'm not sure what that'd look like, or how to compare with HBase right > now? > > On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: > > Here's a nice preliminary number with the FST, 50 million dates of the > > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The > > FST is 984 bytes, with an incrementing long to point to the presumably > > MMap'd value data. This's a bit crazy. > > > > Perhaps we should try other increments as well? Given that HBase keys > > especially are probably close increments of each other, I think the > > FST can always be loaded into RAM with pointers out to the actual > > values. > > >
-
Re: prefix compressionJason Rutherglen 2011-06-04, 04:01
Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz
You can see the line that's adding the string: fstBuilder.add(new BytesRef(date), new Long(x)); On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > Jason - are you feeding it that whole string for each date? Input data is > 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? > Is it possible to compress by that much? Maybe I'm missing something about > how the FST works. > > Matt > > > On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <[EMAIL PROTECTED] >> wrote: > >> Also the next thing to measure with the FST is the key lookup speed. >> I'm not sure what that'd look like, or how to compare with HBase right >> now? >> >> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen >> <[EMAIL PROTECTED]> wrote: >> > Here's a nice preliminary number with the FST, 50 million dates of the >> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The >> > FST is 984 bytes, with an incrementing long to point to the presumably >> > MMap'd value data. This's a bit crazy. >> > >> > Perhaps we should try other increments as well? Given that HBase keys >> > especially are probably close increments of each other, I think the >> > FST can always be loaded into RAM with pointers out to the actual >> > values. >> > >> >
-
Re: prefix compressionMatt Corgan 2011-06-04, 04:57
Ah - I see. It's generating multiple duplicate timestamps per millisecond,
so there are fewer than 50mm unique strings. Duplicates just require incrementing a counter. Agree it's very cool though! sent from my phone On Jun 3, 2011 9:02 PM, "Jason Rutherglen" <[EMAIL PROTECTED]> wrote: > Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz > > You can see the line that's adding the string: > > fstBuilder.add(new BytesRef(date), new Long(x)); > > On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: >> Jason - are you feeding it that whole string for each date? Input data is >> 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? >> Is it possible to compress by that much? Maybe I'm missing something about >> how the FST works. >> >> Matt >> >> >> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen < [EMAIL PROTECTED] >>> wrote: >> >>> Also the next thing to measure with the FST is the key lookup speed. >>> I'm not sure what that'd look like, or how to compare with HBase right >>> now? >>> >>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen >>> <[EMAIL PROTECTED]> wrote: >>> > Here's a nice preliminary number with the FST, 50 million dates of the >>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The >>> > FST is 984 bytes, with an incrementing long to point to the presumably >>> > MMap'd value data. This's a bit crazy. >>> > >>> > Perhaps we should try other increments as well? Given that HBase keys >>> > especially are probably close increments of each other, I think the >>> > FST can always be loaded into RAM with pointers out to the actual >>> > values. >>> > >>> >>
-
Re: prefix compressionStack 2011-06-04, 05:57
That can't be true? (smile) How would you search a 'key' in the FST?
St.Ack On Fri, Jun 3, 2011 at 9:01 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz > > You can see the line that's adding the string: > > fstBuilder.add(new BytesRef(date), new Long(x)); > > On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: >> Jason - are you feeding it that whole string for each date? Input data is >> 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? >> Is it possible to compress by that much? Maybe I'm missing something about >> how the FST works. >> >> Matt >> >> >> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <[EMAIL PROTECTED] >>> wrote: >> >>> Also the next thing to measure with the FST is the key lookup speed. >>> I'm not sure what that'd look like, or how to compare with HBase right >>> now? >>> >>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen >>> <[EMAIL PROTECTED]> wrote: >>> > Here's a nice preliminary number with the FST, 50 million dates of the >>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The >>> > FST is 984 bytes, with an incrementing long to point to the presumably >>> > MMap'd value data. This's a bit crazy. >>> > >>> > Perhaps we should try other increments as well? Given that HBase keys >>> > especially are probably close increments of each other, I think the >>> > FST can always be loaded into RAM with pointers out to the actual >>> > values. >>> > >>> >> >
-
Re: prefix compressionJason Rutherglen 2011-06-04, 07:18
I varied the ms increment randomly between 1-20, then created 10 mil
dates. The FST was then 58,481,582 bytes, eg, 57 MB. Guess it's not perfect! 19,739,994 bytes, eg, 18.8 MB for random 1-5 increments. I think that's still pretty good. I need to try varying the long value stored alongside to see how much that affects the total size. How many keys are there for a typical region, and total per region server? One could try MMap'ing the FST however because access to it is completely random, there's a high risk of page faults. If a key set is sequentially incremented by 1, then that's clearly optimal. I'd guess that's a fairly common case though? > How would you search a 'key' in the FST? Searching on a key is via BytesRefFSTEnum.seekCeil method, which'll match anything greater than or equal to the given parameter. Clearly more benchmarking is required, however this would seem to be a highly promising avenue of research for HBase. On Fri, Jun 3, 2011 at 10:57 PM, Stack <[EMAIL PROTECTED]> wrote: > That can't be true? (smile) How would you search a 'key' in the FST? > St.Ack > > On Fri, Jun 3, 2011 at 9:01 PM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: >> Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz >> >> You can see the line that's adding the string: >> >> fstBuilder.add(new BytesRef(date), new Long(x)); >> >> On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: >>> Jason - are you feeding it that whole string for each date? Input data is >>> 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? >>> Is it possible to compress by that much? Maybe I'm missing something about >>> how the FST works. >>> >>> Matt >>> >>> >>> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <[EMAIL PROTECTED] >>>> wrote: >>> >>>> Also the next thing to measure with the FST is the key lookup speed. >>>> I'm not sure what that'd look like, or how to compare with HBase right >>>> now? >>>> >>>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen >>>> <[EMAIL PROTECTED]> wrote: >>>> > Here's a nice preliminary number with the FST, 50 million dates of the >>>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The >>>> > FST is 984 bytes, with an incrementing long to point to the presumably >>>> > MMap'd value data. This's a bit crazy. >>>> > >>>> > Perhaps we should try other increments as well? Given that HBase keys >>>> > especially are probably close increments of each other, I think the >>>> > FST can always be loaded into RAM with pointers out to the actual >>>> > values. >>>> > >>>> >>> >> >
-
Re: prefix compressionJason Rutherglen 2011-06-04, 20:10
Here's some more data for the 10 mil dates:
68.1 MB random increment up to 1000 87.1 MB random increment up to 10,000 162.1 MB total not using the FST On Fri, Jun 3, 2011 at 10:57 PM, Stack <[EMAIL PROTECTED]> wrote: > That can't be true? (smile) How would you search a 'key' in the FST? > St.Ack > > On Fri, Jun 3, 2011 at 9:01 PM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: >> Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz >> >> You can see the line that's adding the string: >> >> fstBuilder.add(new BytesRef(date), new Long(x)); >> >> On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: >>> Jason - are you feeding it that whole string for each date? Input data is >>> 17 bytes per record * 50mm records = 850MB, and that reduces to 984 bytes? >>> Is it possible to compress by that much? Maybe I'm missing something about >>> how the FST works. >>> >>> Matt >>> >>> >>> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <[EMAIL PROTECTED] >>>> wrote: >>> >>>> Also the next thing to measure with the FST is the key lookup speed. >>>> I'm not sure what that'd look like, or how to compare with HBase right >>>> now? >>>> >>>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen >>>> <[EMAIL PROTECTED]> wrote: >>>> > Here's a nice preliminary number with the FST, 50 million dates of the >>>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond. The >>>> > FST is 984 bytes, with an incrementing long to point to the presumably >>>> > MMap'd value data. This's a bit crazy. >>>> > >>>> > Perhaps we should try other increments as well? Given that HBase keys >>>> > especially are probably close increments of each other, I think the >>>> > FST can always be loaded into RAM with pointers out to the actual >>>> > values. >>>> > >>>> >>> >> >
-
Re: prefix compressionStack 2011-06-04, 20:58
On Fri, Jun 3, 2011 at 7:03 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
>> Pluggable formats would help here so you could tune for mem vs cpu. More history. At the time of KV and hfile incubation, we thought about making these building blocks pluggable but it was thought that there would be a performance cost doing the plug-in lookup so we passed. Just FYI. St.Ack |