Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> DISCUSS : HFile V3 proposal for tags in 0.96


Copy link to this message
-
Re: DISCUSS : HFile V3 proposal for tags in 0.96
On Fri, Jul 19, 2013 at 4:23 AM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> If tags are activated but empty, is it going to be the
> same thing? Or are we going to have all the tags overhead? Like can we have
> a byte to say "no tags in that file" in addition to "tags are activated for
> that file"?
>

This reminds me of an interesting discussion we had. So like with
memstoreTS, if we determine that no cells in a file have tags (or
timestamps) then we can flag that in file metadata and turn off any related
persistence when writing out the data blocks. With millions of KVs in a
file that can achieve substantial space savings. Having a new file format
on the table also opens up possibilities like block headers: an N-byte
structure (where N is something like 4 or 8 bytes maybe) at the start of
each block that describes the encoding strategy taken for the block:
whether tags are present or not, if we used FAST_DIFF, or some new packing
together of related values (we put the keys up front with one or two byte
pointers into the block where their values are, de-dup values in the latter
part of the block), or a dictionary scheme (and with which dictionary in
what meta block) etc. We might borrow ideas from Parquet or ORC. We can
stop serializing HFile blocks as individual cells into streams and look at
them as a group of cells to write into a bytebuffer, providing a lot more
freedom for efficiently structuring the internal details of the block. Let
me make sure this point makes it out into the public discussion, to
highlight the additional benefit of having an experimental file format
available in the 0.96 cycle - it's a place where we and users can go off on
new directions far beyond inline tags. Of course such changes in unreleased
trunk code could make that possible too, but what I have observed is
"professional" HBase devs are much more likely to look at trunk than a
user. Users really want to work on and contribute a patch for what they are
running in production. Consider recent contributions from Yahoo and Taobao
as an example of what I mean. The bar for putting something into V2 is
extremely high as it should be on account of how performance critical that
code is. I'm not suggesting less rigor for V3, what I am suggesting is V3
can provide design freedom by going in different directions than the legacy
V2 code.

--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)