Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Disk Seeks and Column families


Copy link to this message
-
Re: Disk Seeks and Column families
Andrey Stepachev 2012-01-21, 18:58
21 января 2012 г. 19:16 п��льзователь Doug Meil
<[EMAIL PROTECTED]> написал:
>
> One other "big picture" comment:  Hbase scales by having lots of servers,
> and servers with multiple drives. While single-read performance is
> obviously important, there is more to Hbase than a single-server RDBMS
> drag-race comparison.  It's a distributed architecture (as with MapReduce).
>
> re:  "hbase is not so good in case of wide tables, hbase prefers tall
> tables"
>
> Per... http://hbase.apache.org/book.html#schema.smackdown  this is
> absolutely true in the extreme cases as described in the book, but I
> wouldn't consider hundreds or thousands of attributes to be in that
> category as the definition of "wide" tends to be subjective.

This statement mostly related to schemas, where column name is
a subkey. For example: timeseries for some attribute. Such situation
not scales well, and not handled well by hbase.
(due of splits, which are performed on rows boundary).

>
>
>
>
> On 1/21/12 8:52 AM, "Doug Meil" <[EMAIL PROTECTED]> wrote:
>
>>
>>Also, for #2 Hbase supports large-scale aggregation through MapReduce.
>>
>>
>>
>>
>>On 1/21/12 7:47 AM, "Andrey Stepachev" <[EMAIL PROTECTED]> wrote:
>>
>>>2012/1/21 Praveen Sripati <[EMAIL PROTECTED]>:
>>>> Hi,
>>>>
>>>> 1) According to the this url (1), HBase performs well for two or three
>>>> column families. Why is it so?
>>>
>>>Frist, each column family stored in separate location, so, as stated in
>>>'6.2.1. Cardinality of ColumnFamilies', such schema design can lead
>>>to many small pieces for small column family and aggregate should
>>>perform slowly.
>>>Second, if region split, all column families will split too,
>>>in case of large  number of them whis can be inefficient.
>>>Third, related to number of memstores. Each column family
>>>has it's own memstore, so it is more likely to hit forced flush
>>>and bloсked writes.
>>>
>>>>
>>>> 2) Dump of a HFile, looks like below. The contents of a row stay
>>>>together
>>>> like a regular row-oriented database. If the column family has 100
>>>>column
>>>> family qualifiers and is dense then the data for a particular column
>>>>family
>>>> qualifier is spread wide. If I want to do an aggregation on a
>>>>particular
>>>> column identifier, the disk seeks doesn't seems to be much better than
>>>>a
>>>> regular row-oriented database.
>>>
>>>You don't need seeks for each column, hbase reads blocks and filter
>>>out uneeded data.
>>>And most pefromance gained from collocated keys and compression.
>>>BTW, hbase is not so good in case of wide tables, hbase prefers tall
>>>tables.
>>>
>>>>
>>>> Please correct me if I am wrong.
>>>>
>>>> K: row-550/colfam1:50/1309813948188/Put/vlen=2 V: 50
>>>> K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
>>>> K: row-551/colfam1:51/1309813948222/Put/vlen=2 V: 51
>>>> K: row-551/colfam1:51/1309812287200/Put/vlen=2 V: 51
>>>> K: row-552/colfam1:52/1309813948256/Put/vlen=2 V: 52
>>>>
>>>> (1) - http://hbase.apache.org/book/number.of.cfs.html
>>>>
>>>> Thanks,
>>>> Praveen
>>>
>>>
>>>
>>>--
>>>Andrey.
>>>
>>
>>
>>
>
>

--
Andrey.