jeremy p 2013-09-24, 21:34
Jean-Marc Spaggiari 2013-09-24, 21:50
Varun Sharma 2013-09-24, 22:16
Michael Segel 2013-09-25, 00:57
Michael Webster 2013-09-25, 05:16
jeremy p 2013-09-26, 19:25
-Re: Is there a problem with having 4000 tables in a cluster?
Varun Sharma 2013-09-25, 04:55
Okay, thanks for the explanation. You can hash or salt (as many people say)
the keys to avoid the hot spotting problem. What this means is that you
push the part that issues filtered range queries to HBase into the reduce
The idea is this:
1) You get your query '<Pos>_<WORD>' in mapper and then you build a key
"Hash[<Pos>_<WORD>]_<Pos>_<WORD>" and let it through to the reduce phase.
2) Your reduce phase ends up sorting the keys by the "Hash" and if you
choose a nice random hash, you get this really nice distribution in your
reduce phase. At this point, the reducers will encounter keys from all
kinds of positions - you just need to strip out the Hash. And then you
issue reads/your range queries against HBase in the reduce phase.
This obviously means that you will spend some extra resources in temporary
storage during the shuffle phase etc. but it should help avoid your
hotspotting problem. Unless something prevents you from adding a reduce
phase, this is probably a better idea than creating 4K tables.
On Tue, Sep 24, 2013 at 5:57 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> Since different people use different terms... Salting is BAD. (You need to
> understand what is implied by the term salt.)
> What you really want to do is take the hash of the key, and then truncate
> the hash. Use that instead of a salt.
> Much better than a salt.
> Sent from a remote device. Please excuse any typos...
> Mike Segel
> > On Sep 24, 2013, at 5:17 PM, "Varun Sharma" <[EMAIL PROTECTED]> wrote:
> > Its better to do some "salting" in your keys for the reduce phase.
> > Basically, make ur key be something like "KeyHash + Key" and then decode
> > in your reducer and write to HBase. This way you avoid the hotspotting
> > problem on HBase due to MapReduce sorting.
> > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> > [EMAIL PROTECTED]> wrote:
> >> Hi Jeremy,
> >> I don't see any issue for HBase to handle 4000 tables. However, I don't
> >> think it's the best solution for your use case.
> >> JM
> >> 2013/9/24 jeremy p <[EMAIL PROTECTED]>
> >>> Short description : I'd like to have 4000 tables in my HBase cluster.
> >> Will
> >>> this be a problem? In general, what problems do you run into when you
> >> try
> >>> to host thousands of tables in a cluster?
> >>> Long description : I'd like the performance advantage of pre-split
> >> tables,
> >>> and I'd also like to do filtered range scans. Imagine a keyspace where
> >> the
> >>> key consists of : [POSITION]_[WORD] , where POSITION is a number from 1
> >> to
> >>> 4000, and WORD is a string consisting of 96 characters. The value in
> >>> cell would be a single integer. My app will examine a 'document',
> >>> each 'line' consists of 4000 WORDs. For each WORD, it'll do a filtered
> >>> regex lookup. Only problem? Say I have 200 mappers and they all start
> >> at
> >>> POSITION 1, my region servers would get hotspotted like crazy. So my
> >>> is to break it into 4000 tables (one for each POSITION), and then
> >> pre-split
> >>> the tables such that each region gets an equal amount of the traffic.
> >>> this scenario, the key would just be WORD. Dunno if this a bad idea,
> >> would
> >>> be open to suggestions
> >>> Thanks!
> >>> --J
jeremy p 2013-09-24, 22:52
Jean-Marc Spaggiari 2013-09-24, 23:16
Varun Sharma 2013-09-24, 23:22
Jean-Marc Spaggiari 2013-09-24, 23:32
jeremy p 2013-09-25, 00:11