Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Insert into tall table 50% faster than wide table


Copy link to this message
-
Re: Insert into tall table 50% faster than wide table
Bryan Keller 2010-12-23, 04:34
So for the tall table the row key is customer ID + order ID.

For the wide table, the row key is customer ID. The column names (qualifiers) are prefixed by the order ID so they are unique per order, i.e. there are 10 columns prefixed with the first order's ID, there are 10 columns prefixed with the second order'd ID, etc. The order and customer IDs are random UUIDs (16 bytes).

On Dec 22, 2010, at 7:35 PM, Michael Segel wrote:

>
> Ted,
>
> yes, 10K rows one for each customer.
> But if you write each order as a column, and there are 10 'columns' in an order, you have to somehow serialize the 10 columns that represent the order so you get one column per order_id.
> Of course you could still write out a column as order_id,order_column and then get your 6000 columns. If you did that, then you have the issue of your column id. Did you go column_id,order_id or did you go order_id, column_id?
> (One has to ask... :-)  )
>
> IMHO I'd elect to put the 10 columns of the order in a single column rather than write the 10 columns as individual columns.  But that's just me. :-)
>
> -Mike
>
>
>> Date: Wed, 22 Dec 2010 19:00:25 -0800
>> Subject: Re: Insert into tall table 50% faster than wide table
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>>
>>> Each column is the order so you write one column for each order
>> As stated earlier, wide table has 6,000 columns instead of 600. :-)
>>
>> Bryan:
>> Can you describe how you form row keys in each case ?
>>
>>
>> On Wed, Dec 22, 2010 at 6:53 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
>>
>>>
>>> HBase does version cells.
>>>
>>> But I saw something of interest:
>>> "
>>>>>> In my test, there are 10,000 customers, each customer has 600 orders
>>> and each order has 10 columns. The tall table approach results in 6 mil rows
>>> of 10 columns. The wide table approach results is 10,000 rows of 6,000
>>> columns. I'm using hbase 0.89-20100924 and hadoop 0.20.2. I am adding the
>>> orders using a Put for each order, submitted in batches of 1000 as a list of
>>> Puts.
>>>>>>
>>>>>> Are there techniques to speed up inserts with the wide table approach
>>> that I am perhaps overlooking?
>>>>>>
>>>>>
>>>> "
>>>
>>> Ok, so you have 10K by 600 by 10. So the 'tall' design has a row key of
>>> customer_id and Order_id with 10 columns in a single column family.
>>> So you get 6 million rows and 10 column puts.
>>>
>>> Now if you do a 'wide' table...
>>> Your row key is the 'customer_id' only. Each column is the order so you
>>> write one column for each order and you have to figure out how you represent
>>> your columns in the order.
>>> (An example... your order of 10 items is represented by a string with a
>>> 'special character' used as a column separator in the order.)
>>> So you're doing one column write for each order and you have a total of 10K
>>> rows.
>>>
>>> Unless I'm missing something part of the 'slowness' could be how your
>>> writing your orders on your wide table. There are a couple other unknowns.
>>> Are you hashing your keys?
>>> I mean are you getting a bit of 'randomness' in your keys?
>>>
>>> So what am I missing?
>>>
>>> -Mike
>>>
>>>
>>>> Subject: Re: Insert into tall table 50% faster than wide table
>>>> From: [EMAIL PROTECTED]
>>>> Date: Wed, 22 Dec 2010 18:24:05 -0800
>>>> To: [EMAIL PROTECTED]
>>>>
>>>> Actually I don't think this is the problem as HBase versions cells, not
>>> rows, if I understand correctly.
>>>>
>>>> On Dec 22, 2010, at 5:03 PM, Bryan Keller wrote:
>>>>
>>>>> Perhaps slow wide table insert performance is related to row
>>> versioning? If I have a customer row and keep adding order columns one by
>>> one, I'm thinking that there might be a version kept of the row for every
>>> order I add? If I am simply inserting a new row for every order, there is no
>>> versioning going on. Could this be causing performance problems?
>>>>>
>>>>> On Dec 22, 2010, at 4:16 PM, Bryan Keller wrote:
>>>>>
>>>>>> It appears to be the same or better, not to derail my original