Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Disk Seeks and Column families


Copy link to this message
-
Re: Disk Seeks and Column families
21 января 2012 г. 19:16 п��льзователь Doug Meil
<[EMAIL PROTECTED]> написал:
>
> One other "big picture" comment:  Hbase scales by having lots of servers,
> and servers with multiple drives. While single-read performance is
> obviously important, there is more to Hbase than a single-server RDBMS
> drag-race comparison.  It's a distributed architecture (as with MapReduce).
>
> re:  "hbase is not so good in case of wide tables, hbase prefers tall
> tables"
>
> Per... http://hbase.apache.org/book.html#schema.smackdown  this is
> absolutely true in the extreme cases as described in the book, but I
> wouldn't consider hundreds or thousands of attributes to be in that
> category as the definition of "wide" tends to be subjective.

This statement mostly related to schemas, where column name is
a subkey. For example: timeseries for some attribute. Such situation
not scales well, and not handled well by hbase.
(due of splits, which are performed on rows boundary).

>
>
>
>
> On 1/21/12 8:52 AM, "Doug Meil" <[EMAIL PROTECTED]> wrote:
>
>>
>>Also, for #2 Hbase supports large-scale aggregation through MapReduce.
>>
>>
>>
>>
>>On 1/21/12 7:47 AM, "Andrey Stepachev" <[EMAIL PROTECTED]> wrote:
>>
>>>2012/1/21 Praveen Sripati <[EMAIL PROTECTED]>:
>>>> Hi,
>>>>
>>>> 1) According to the this url (1), HBase performs well for two or three
>>>> column families. Why is it so?
>>>
>>>Frist, each column family stored in separate location, so, as stated in
>>>'6.2.1. Cardinality of ColumnFamilies', such schema design can lead
>>>to many small pieces for small column family and aggregate should
>>>perform slowly.
>>>Second, if region split, all column families will split too,
>>>in case of large  number of them whis can be inefficient.
>>>Third, related to number of memstores. Each column family
>>>has it's own memstore, so it is more likely to hit forced flush
>>>and bloсked writes.
>>>
>>>>
>>>> 2) Dump of a HFile, looks like below. The contents of a row stay
>>>>together
>>>> like a regular row-oriented database. If the column family has 100
>>>>column
>>>> family qualifiers and is dense then the data for a particular column
>>>>family
>>>> qualifier is spread wide. If I want to do an aggregation on a
>>>>particular
>>>> column identifier, the disk seeks doesn't seems to be much better than
>>>>a
>>>> regular row-oriented database.
>>>
>>>You don't need seeks for each column, hbase reads blocks and filter
>>>out uneeded data.
>>>And most pefromance gained from collocated keys and compression.
>>>BTW, hbase is not so good in case of wide tables, hbase prefers tall
>>>tables.
>>>
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>>>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>>>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>>>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>>>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>>>>
>>>> (1) - http://hbase.apache.org/book/number.of.cfs.html
>>>>
>>>> Thanks,
>>>> Praveen
>>>
>>>
>>>
>>>--
>>>Andrey.
>>>
>>
>>
>>
>
>

--
Andrey.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB