Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Hbase scalability performance


Copy link to this message
-
Re: Hbase scalability performance
Michael Segel 2012-12-22, 16:23
I thought it was Doug Miel who said that HBase doesn't start to shine until you had at least 5 nodes.
(Apologies if I misspelled Doug's name.)

I happen to concur and if you want to start testing scalability, you will want to build a bigger test rig.

Just saying!
Oh and you're going to have a hot spot on that row key.
Maybe do a hashed UUID ?

I would suggest that you consider the following:

Create N number of rows... where N is a very large number of rows.
Then to generate your random access, do a full table scan to get the N row keys in to memory.
Using a random number generator,  generate a random number and pop that row off the stack so that the next iteration is between 1 and (N-1).
Do this 200K times.

Now time your 200K random fetches.

It would be interesting to see how it performs  getting an average of a 'couple' of runs... then increase the key space by an order of magnitude.
(Start w 1 million rows, 10 million rows, 100 million rows.... )

In theory... if properly tuned. One should expect near linear results .  That is to say the time it takes to get() a row across the data space should be consistent. Although I wonder if you would have to somehow clear the cache?
Sorry, just a random thought...

-Mike

On Dec 22, 2012, at 10:06 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> By '3 datanodes', did you mean that you also increased the number of region
> servers to 3 ?
>
> When your test was running, did you look at Web UI to see whether load was
> balanced ? You can also use Ganglia for such purpose.
>
> What version of HBase are you using ?
>
> Thanks
>
> On Sat, Dec 22, 2012 at 7:43 AM, Dalia Sobhy <[EMAIL PROTECTED]>wrote:
>
>> Dear all,
>>
>> I am testing a simple hbase application on a cluster of multiple nodes.
>>
>> I am especially testing the scalability performance, by measuring the time
>> taken for random reads
>>
>> Data size: 200,000 row
>> Row key : 0,1,2 very simple row key incremental
>>
>> But i don't know why by increasing the cluster size, I see the same time.
>>
>> For ex:
>> 2 Datanodes: 1000 random read: 1.757 sec
>> 3 datanodes: 1000 random read: 1.7 sec
>>
>> So any help plzzz ??
>>
>>