Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> More on Column Family versus Column


Copy link to this message
-
Re: More on Column Family versus Column
I'm trying to work up a reference card to remember this stuff.  Can someone
confirm or deny the following statements?

Each hbase block can hold at most, one row and one column family.  A row may
contain multiple hbase blocks but an hbase block may only contain one row.

Thanks,
Jacques

On Fri, Oct 15, 2010 at 8:54 PM, William Kang <[EMAIL PROTECTED]>wrote:

> Hi Jacques,
> If I understand correctly, it depends on several factors. First is the
> configured block size; second is the typical cell size. A block may
> have multiple keyvalue pairs. If the block size is bigger than the
> cell size, a block may have multiple cells, which are stored in block
> as keyvalue pairs. To locate a keyvalue pair, you have to traverse
> through within the block if there are multiple keyvalue pairs inside
> the block.
> With that being said, if you have a column family with lots of very
> small cell values and large block size, it is going to be slow to
> traverse inside the block to locate the wanted cell. But, if you have
> a column family with few big cells inside it and the block size is
> only big enough to host one cell, there is no need to traverse in the
> block.
> Hope it helps a little.
>
>
> William
>
> On Fri, Oct 15, 2010 at 8:27 PM, Jacques <[EMAIL PROTECTED]> wrote:
> > I was hoping for some feedback on a schema design choice we made.
> >
> > We are currently using column families to separate out some data in a
> table
> > (based on what we've read here and elsewhere).  I try to outline the
> basic
> > below.
> >
> > *Pseudo schema*
> > metadata column family: multiple metadata columns totaling ~3-5k total
> > data column family 1: single column, 100-200k
> > data column family 2: same as data column family 1
> > ...
> > data column family 1500: same as data column family 1
> >
> > General access pattern:
> > write: main cf + one random data cf.
> > read: main cf + one random data cf.
> >
> > The further we go towards the 1500, the more sparse the data is.  E.g.
> every
> > row has data for cf1, most have for cf2, only 1 in a million might have
> it
> > for cf1500.
> > We chose to use column families because we never/rarely change or
> retrieve
> > two "data" column families at the same time.  We store this information
> in a
> > single row so that we have atomic changes to the dataset.
> >
> > Everything is working fine.  However, the discussion earlier this week
> about
> > column families made me realize that my understanding of columns wasn't
> > entirely correct.  I was under the impression that an entire column
> family
> > was read when retrieving any column in that family.  It sounds like this
> is
> > becoming less true as development move towards .90 and beyond.  I also
> > noticed that the web status gui doesn't do tables with many column
> families
> > any justice.  This makes me wonder if people are using tables with
> thousands
> > of column families or if it is very rare?  How do people accomplish
> > "millions of columns"?  10 families with 100,000 columns each or 10,000
> > families with 100's of columns each?
> >
> > Thanks for any feedback,
> >
> > Jacques
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB