Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Is there a problem with having 4000 tables in a cluster?

Copy link to this message
Re: Is there a problem with having 4000 tables in a cluster?
Since different people use different terms... Salting is BAD. (You need to understand what is implied by the term salt.)

What you really want to do is take the hash of the key, and then truncate the hash. Use that instead of a salt.

Much better than a salt.

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Sep 24, 2013, at 5:17 PM, "Varun Sharma" <[EMAIL PROTECTED]> wrote:
> Its better to do some "salting" in your keys for the reduce phase.
> Basically, make ur key be something like "KeyHash + Key" and then decode it
> in your reducer and write to HBase. This way you avoid the hotspotting
> problem on HBase due to MapReduce sorting.
> On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
>> Hi Jeremy,
>> I don't see any issue for HBase to handle 4000 tables. However, I don't
>> think it's the best solution for your use case.
>> JM
>> 2013/9/24 jeremy p <[EMAIL PROTECTED]>
>>> Short description : I'd like to have 4000 tables in my HBase cluster.
>> Will
>>> this be a problem?  In general, what problems do you run into when you
>> try
>>> to host thousands of tables in a cluster?
>>> Long description : I'd like the performance advantage of pre-split
>> tables,
>>> and I'd also like to do filtered range scans.  Imagine a keyspace where
>> the
>>> key consists of : [POSITION]_[WORD] , where POSITION is a number from 1
>> to
>>> 4000, and WORD is a string consisting of 96 characters.  The value in the
>>> cell would be a single integer.  My app will examine a 'document', where
>>> each 'line' consists of 4000 WORDs.  For each WORD, it'll do a filtered
>>> regex lookup.  Only problem?  Say I have 200 mappers and they all start
>> at
>>> POSITION 1, my region servers would get hotspotted like crazy. So my idea
>>> is to break it into 4000 tables (one for each POSITION), and then
>> pre-split
>>> the tables such that each region gets an equal amount of the traffic.  In
>>> this scenario, the key would just be WORD.  Dunno if this a bad idea,
>> would
>>> be open to suggestions
>>> Thanks!
>>> --J