I'm trying to work up a reference card to remember this stuff. Can someone
confirm or deny the following statements?
Each hbase block can hold at most, one row and one column family. A row may
contain multiple hbase blocks but an hbase block may only contain one row.
On Fri, Oct 15, 2010 at 8:54 PM, William Kang <[EMAIL PROTECTED]>wrote:
> Hi Jacques,
> If I understand correctly, it depends on several factors. First is the
> configured block size; second is the typical cell size. A block may
> have multiple keyvalue pairs. If the block size is bigger than the
> cell size, a block may have multiple cells, which are stored in block
> as keyvalue pairs. To locate a keyvalue pair, you have to traverse
> through within the block if there are multiple keyvalue pairs inside
> the block.
> With that being said, if you have a column family with lots of very
> small cell values and large block size, it is going to be slow to
> traverse inside the block to locate the wanted cell. But, if you have
> a column family with few big cells inside it and the block size is
> only big enough to host one cell, there is no need to traverse in the
> Hope it helps a little.
> On Fri, Oct 15, 2010 at 8:27 PM, Jacques <[EMAIL PROTECTED]> wrote:
> > I was hoping for some feedback on a schema design choice we made.
> > We are currently using column families to separate out some data in a
> > (based on what we've read here and elsewhere). I try to outline the
> > below.
> > *Pseudo schema*
> > metadata column family: multiple metadata columns totaling ~3-5k total
> > data column family 1: single column, 100-200k
> > data column family 2: same as data column family 1
> > ...
> > data column family 1500: same as data column family 1
> > General access pattern:
> > write: main cf + one random data cf.
> > read: main cf + one random data cf.
> > The further we go towards the 1500, the more sparse the data is. E.g.
> > row has data for cf1, most have for cf2, only 1 in a million might have
> > for cf1500.
> > We chose to use column families because we never/rarely change or
> > two "data" column families at the same time. We store this information
> in a
> > single row so that we have atomic changes to the dataset.
> > Everything is working fine. However, the discussion earlier this week
> > column families made me realize that my understanding of columns wasn't
> > entirely correct. I was under the impression that an entire column
> > was read when retrieving any column in that family. It sounds like this
> > becoming less true as development move towards .90 and beyond. I also
> > noticed that the web status gui doesn't do tables with many column
> > any justice. This makes me wonder if people are using tables with
> > of column families or if it is very rare? How do people accomplish
> > "millions of columns"? 10 families with 100,000 columns each or 10,000
> > families with 100's of columns each?
> > Thanks for any feedback,
> > Jacques