Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> DISCUSS : HFile V3 proposal for tags in 0.96


Copy link to this message
-
Re: DISCUSS : HFile V3 proposal for tags in 0.96
I was reading Owen's presentation at Hadoop Summit on ORC.

Slide #14 describes how codecs are used for generic compression.

I think we can adopt some of their ideas in HFile v3.

Cheers

On Fri, Jul 19, 2013 at 9:48 AM, Andrew Purtell <[EMAIL PROTECTED]> wrote:

> On Fri, Jul 19, 2013 at 4:23 AM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>
> > If tags are activated but empty, is it going to be the
> > same thing? Or are we going to have all the tags overhead? Like can we
> have
> > a byte to say "no tags in that file" in addition to "tags are activated
> for
> > that file"?
> >
>
> This reminds me of an interesting discussion we had. So like with
> memstoreTS, if we determine that no cells in a file have tags (or
> timestamps) then we can flag that in file metadata and turn off any related
> persistence when writing out the data blocks. With millions of KVs in a
> file that can achieve substantial space savings. Having a new file format
> on the table also opens up possibilities like block headers: an N-byte
> structure (where N is something like 4 or 8 bytes maybe) at the start of
> each block that describes the encoding strategy taken for the block:
> whether tags are present or not, if we used FAST_DIFF, or some new packing
> together of related values (we put the keys up front with one or two byte
> pointers into the block where their values are, de-dup values in the latter
> part of the block), or a dictionary scheme (and with which dictionary in
> what meta block) etc. We might borrow ideas from Parquet or ORC. We can
> stop serializing HFile blocks as individual cells into streams and look at
> them as a group of cells to write into a bytebuffer, providing a lot more
> freedom for efficiently structuring the internal details of the block. Let
> me make sure this point makes it out into the public discussion, to
> highlight the additional benefit of having an experimental file format
> available in the 0.96 cycle - it's a place where we and users can go off on
> new directions far beyond inline tags. Of course such changes in unreleased
> trunk code could make that possible too, but what I have observed is
> "professional" HBase devs are much more likely to look at trunk than a
> user. Users really want to work on and contribute a patch for what they are
> running in production. Consider recent contributions from Yahoo and Taobao
> as an example of what I mean. The bar for putting something into V2 is
> extremely high as it should be on account of how performance critical that
> code is. I'm not suggesting less rigor for V3, what I am suggesting is V3
> can provide design freedom by going in different directions than the legacy
> V2 code.
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB