Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How to speedup Hbase query throughput

Copy link to this message
Re: How to speedup Hbase query throughput
This why I asked you earlier about how you were generating your user ids.

You're not going to get a good distribution.

First, random numbers usually aren't that random.

How many users do you want to simulate?
Try this...
Create n number of type 5 uuids. These are uuids that have been generated, then hashed using a SHA-1hashing algo, and then truncated to the right number of bits.

This will give you a more realistic random distribution of user ids. Note that you will have to remember the user ids! It will also be alpha numeric.
Then you can use your 'month' as part of your key. However... I have to question your design again. Your billing by months means that you will only have 12 months of data and the data generation really isn't random. Meaning you don't generate your data out of sequence.

Just a suggestion... It sounds like you're trying to simulate queries where users get created mid stream and don't always stick around. So when you create a user, you can also simulate his start/join date and his end date and then generate his 'billing' information. I would suggest that instead of using a random number for billing month that you actually create your own time stamp...

I am also assuming that you are generating the data first and then running queries against a static data set?

If this is true, and you create both the uuids and then the billing data, you'll get a better random data set that is going to be more realistic...

Having said all of this...

You have a couple of options..

First you can make your key month+userid, assuming you only have 12 months of data.
Or you can make your key userid+month. This has the additional benefit of collocating your user's data.

Or you could choose a third option....
You are trying to retrieve a user's billing data. This could be an object. So you could store the bill as a column in a table where the column id is the timestamp of the bill.

If you want the last date first, you can do a simple trick... If you are using months... make the column id 99 - the month so that your data is in reverse order.

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 19, 2011, at 7:08 PM, Weihua JIANG <[EMAIL PROTECTED]> wrote:

> Sorry for missing the background.
> We assume user is more interested in his latest bills than his old
> bills. Thus, the query generator is worked as below:
> 1. randomly generate a number and reverse it as user id.
> 2. randomly generate a prioritied month based on the above assumpation.
> 3. ask HBase to query this user + month.
> Thanks
> Weihua
> 2011/5/20 Matt Corgan <[EMAIL PROTECTED]>:
>> I think i traced this to a bug in my compaction scheduler that would have
>> missed scheduling about half the regions, hence the 240gb vs 480gb.  To
>> confirm: major compaction will always run when asked, even if the region is
>> already major compacted, the table settings haven't changed, and it was last
>> major compacted on that same server.  [potential hbase optimization here for
>> clusters with many cold regions].  So my theory about not localizing blocks
>> is false.
>> Weihua - why do you think your throughput doubled when you went from
>> user+month to month+user keys?  Are your queries using an even distribution
>> of months?  I'm not exactly clear on your schema or query pattern.
>> On Thu, May 19, 2011 at 8:39 AM, Joey Echeverria <[EMAIL PROTECTED]> wrote:
>>> I'm surprised the major compactions didn't balance the cluster better.
>>> I wonder if you've stumbled upon a bug in HBase that's causing it to
>>> leak old HFiles.
>>> Is the total amount of data in HDFS what you expect?
>>> -Joey
>>> On Thu, May 19, 2011 at 8:35 AM, Matt Corgan <[EMAIL PROTECTED]> wrote:
>>>> that's right
>>>> On Thu, May 19, 2011 at 8:23 AM, Joey Echeverria <[EMAIL PROTECTED]>
>>> wrote:
>>>>> Am I right to assume that all of your data is in HBase, ie you don't
>>>>> keep anything in just HDFS files?