Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase secondary index performance


Copy link to this message
-
Re: HBase secondary index performance
Andrey Stepachev 2010-09-05, 18:24
2010/9/5 Murali Krishna. P <[EMAIL PROTECTED]>:
> Hi,
>        Thanks for the detailed explanation, I liked the idea of timestamp
> check, this will be good enough for us and I can put a periodic MR cleaner.
> However I need some help in understanding the 30K number that was claimed.

Real insert rate will depend on size of row, size of write buffer etc.
In case of simple row with one long  per row i got 30k requests/second
(shown in hbase).
Json serialised objects 100-700bytes each, with validation I can insert 2-6k
objects (json) per second.

With
> the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns).
> I understood that there arean additional reads that indextable does but  25X
> improvement that you got is very impressive. Can you please help me to
> understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)

Did you try to insert data into non indexed region (disable
indexedtables extension)?
What numbers did you got?

>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrey Stepachev <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Sun, 5 September, 2010 3:53:26 AM
> Subject: Re: HBase secondary index performance
>
> 2010/9/3 Murali Krishna. P <[EMAIL PROTECTED]>:
>
>>        * custom indexing is good, but our data keeps changing every day. So,
>>probably
>> indextable is the best option for us
>
> In case of custom indexing you can use timestamps to check, that index
> record still valid.
> (or ever simply recheck existance of the value)
> Also you need regular index cleanup (mr job or some custom application).
>
> To index some row identified by 'key' having 'value', we can create
> index table,
> where key will be [value:key] and insert rows every time, when we insert
> our values. We will got 30k rows/s/node.
> When we want to find all 'value', we scan [value:0000, value:9999] and
> find all keys,
> which point to rows, containing values.
> We scan index, random get rows, recheck, that index is still valid
> (check value or timestamp, index timestamp should be >= value timestamp) and
> return only valid values (may be we can even delete on the fly when we
> got negative
> result to automatically clenup stale data).
>
>
>>        * Just added one more regionserver and it did not help. Actually it went
>>back
>> to 60/s for some strange reason(with one client). The requests in the hbase ui
>> is not uniform across 2 region servers. One server is doing around 2000 and
> the
>> other 500. Probably once the region gets split and when we have lots of data,
>> writes will improve ? (Now it is just writing to one region for the main
> table)
>
> Looks like all data goes to one region server. Try to make more random writes
> (may be you should make key as random uuid or other key randomization technique)
>
>>        * Is there some way to do bulk load the indexedtable? Earlier I have
>>used the
>> bulk loader tool (mapreduce job which creates the regions offline) but not
> sure
>> whether it works with indexed table.
>
> No sure, but you can look at source code, and try to emulate indexing
> operations in
> your code after regular bulk loading.
>
>>
>>
>>  Thanks,
>> Murali Krishna
>>
>>
>
> Andrey.
>