-Re: Disk Seeks and Column families
Andrey Stepachev 2012-01-24, 06:51
2012/1/24 Praveen Sripati <[EMAIL PROTECTED]>:
> Thanks for the response. I am just getting started with HBase. And before
> getting into the code/api level details, I am trying to understand the
> problem area HBase is trying to address through it's architecture/design.
> 1) So, what are the recommendations for having many columns and with dense
> data? Is HBase not the right tool?
Split them by prefixing keys. (i.e. key->a,b,c => a_key, b_key, c_key).
and aggregate as independent values. (if possible)
> 2) Also, if the data for a column is spread wide across blocks and maybe
> even across nodes how will HBase help in aggregation?
Think and optimize your data layout for your tasks. HBase is not an rdbs.
You should plan you schema in a way, that suites best for your queries.
> 3) Also, about storing data using an incremental row key, initially there
> will be a hot stop with the data getting to a single region. Even after a
> split of the region into two, the first one won't be getting any data (in
> incremental row key) and the second one will be hammered.
a) As in 1), add something to key. For example each 5 minutes. Later your
can issue 16 queries and merge them (for realtime)
b) If this data for mapreduce, you can do key day + (md5(time)) later MR
task collect all data in right place for aggregation.
(as usual you must tradeoff write speed and query speed).
c) split your incoming data by other field, for example host or metric.
You can look at data model of the http://opentsdb.net/
> One of the approach to alleviate this is not to insert incremental row keys
> from the client and have the row keys scattered for better load balancing.
> But, this approach is not efficient if I want to get events in a time
> sequence, in which case I have to use some filters to scan the entire data.
> 4) Still not clear why I can't have 10 column families in HBase and why
> only 2 or 3 of them according to this link (1)?
a) you should tune a bunch of parameters
hbase.hstore.blockingStoreFiles and others
to get it works at high write load. But according to architecture
of memstore and splits less families performs better.
b) you can write small benchmark and see, that 2 family is significally faster
> (1) - http://hbase.apache.org/book/number.of.cfs.html
> On Sun, Jan 22, 2012 at 12:02 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote:
>> basically you are correct on all counts. If there are too many columns,
>> HBase will have to issue more disk-seeks to extract only the particular
>> columns you need ... and since the data is laid out horizontally there are
>> fewer common substrings in a single HBase-block and compression quality
>> starts to degrade due to reduced redundancy.
>> On Sat, Jan 21, 2012 at 9:49 AM, Praveen Sripati
>> <[EMAIL PROTECTED]>wrote:
>> > Thanks for the response.
>> > > The contents of a row stay together like a regular row-oriented
>> > > K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>> > > K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>> > > K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>> > > K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>> > > K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>> > Is the above statement true for a HFile?
>> > Also from the above example, the data for the column family qualifier are
>> > not adjacent to take advantage of compression (
>> > http://en.wikipedia.org/wiki/Column-oriented_DBMS#Compression). Is this
>> > proper statement?
>> > Regards,
>> > Praveen
>> > On Sat, Jan 21, 2012 at 9:03 PM, <[EMAIL PROTECTED]> wrote:
>> > > Have you considered using AggregationProtocol to perform aggregation ?
>> > >
>> > > Thanks
>> > >
>> > >
>> > >
>> > > On Jan 20, 2012, at 11:08 PM, Praveen Sripati <
>> [EMAIL PROTECTED]>
>> > > wrote:
>> > >
>> > > > Hi,
>> > >