I was hoping for some feedback on a schema design choice we made.
We are currently using column families to separate out some data in a table
(based on what we've read here and elsewhere). I try to outline the basic
metadata column family: multiple metadata columns totaling ~3-5k total
data column family 1: single column, 100-200k
data column family 2: same as data column family 1
data column family 1500: same as data column family 1
General access pattern:
write: main cf + one random data cf.
read: main cf + one random data cf.
The further we go towards the 1500, the more sparse the data is. E.g. every
row has data for cf1, most have for cf2, only 1 in a million might have it
We chose to use column families because we never/rarely change or retrieve
two "data" column families at the same time. We store this information in a
single row so that we have atomic changes to the dataset.
Everything is working fine. However, the discussion earlier this week about
column families made me realize that my understanding of columns wasn't
entirely correct. I was under the impression that an entire column family
was read when retrieving any column in that family. It sounds like this is
becoming less true as development move towards .90 and beyond. I also
noticed that the web status gui doesn't do tables with many column families
any justice. This makes me wonder if people are using tables with thousands
of column families or if it is very rare? How do people accomplish
"millions of columns"? 10 families with 100,000 columns each or 10,000
families with 100's of columns each?
Thanks for any feedback,