Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - More on Column Family versus Column


+
Jacques 2010-10-16, 00:27
+
William Kang 2010-10-16, 03:54
Copy link to this message
-
Re: More on Column Family versus Column
Jacques 2010-10-18, 16:17
I'm trying to work up a reference card to remember this stuff.  Can someone
confirm or deny the following statements?

Each hbase block can hold at most, one row and one column family.  A row may
contain multiple hbase blocks but an hbase block may only contain one row.

Thanks,
Jacques

On Fri, Oct 15, 2010 at 8:54 PM, William Kang <[EMAIL PROTECTED]>wrote:

> Hi Jacques,
> If I understand correctly, it depends on several factors. First is the
> configured block size; second is the typical cell size. A block may
> have multiple keyvalue pairs. If the block size is bigger than the
> cell size, a block may have multiple cells, which are stored in block
> as keyvalue pairs. To locate a keyvalue pair, you have to traverse
> through within the block if there are multiple keyvalue pairs inside
> the block.
> With that being said, if you have a column family with lots of very
> small cell values and large block size, it is going to be slow to
> traverse inside the block to locate the wanted cell. But, if you have
> a column family with few big cells inside it and the block size is
> only big enough to host one cell, there is no need to traverse in the
> block.
> Hope it helps a little.
>
>
> William
>
> On Fri, Oct 15, 2010 at 8:27 PM, Jacques <[EMAIL PROTECTED]> wrote:
> > I was hoping for some feedback on a schema design choice we made.
> >
> > We are currently using column families to separate out some data in a
> table
> > (based on what we've read here and elsewhere).  I try to outline the
> basic
> > below.
> >
> > *Pseudo schema*
> > metadata column family: multiple metadata columns totaling ~3-5k total
> > data column family 1: single column, 100-200k
> > data column family 2: same as data column family 1
> > ...
> > data column family 1500: same as data column family 1
> >
> > General access pattern:
> > write: main cf + one random data cf.
> > read: main cf + one random data cf.
> >
> > The further we go towards the 1500, the more sparse the data is.  E.g.
> every
> > row has data for cf1, most have for cf2, only 1 in a million might have
> it
> > for cf1500.
> > We chose to use column families because we never/rarely change or
> retrieve
> > two "data" column families at the same time.  We store this information
> in a
> > single row so that we have atomic changes to the dataset.
> >
> > Everything is working fine.  However, the discussion earlier this week
> about
> > column families made me realize that my understanding of columns wasn't
> > entirely correct.  I was under the impression that an entire column
> family
> > was read when retrieving any column in that family.  It sounds like this
> is
> > becoming less true as development move towards .90 and beyond.  I also
> > noticed that the web status gui doesn't do tables with many column
> families
> > any justice.  This makes me wonder if people are using tables with
> thousands
> > of column families or if it is very rare?  How do people accomplish
> > "millions of columns"?  10 families with 100,000 columns each or 10,000
> > families with 100's of columns each?
> >
> > Thanks for any feedback,
> >
> > Jacques
> >
>
+
William Kang 2010-10-18, 19:43