jeremy p 2013-09-24, 21:34
Jean-Marc Spaggiari 2013-09-24, 21:50
Varun Sharma 2013-09-24, 22:16
Michael Segel 2013-09-25, 00:57
The biggest issue I see with so many tables is the region counts could get
quite large. With 4000 tables, you will need at least that many regions,
not even accounting for splitting the regions/growth.
Forgive the speculation, but it almost sounds like you want an inverted
index. Could you not just store the word as the rowkey and have each
position be a column? 4000 columns isn't really that many, and if this is
text it will probably be pretty sparse i.e most words won't appear in all
4K positions. When you set up the scanner for your MR job, add the column
for the position that you want.
It seems that the downside of trying to push the reads into the reduce
phase by prepending a hash/salt to each real rowkey is, assuming the hashes
are well distributed, each reducer will get a random slice of the data and
will have to issue individual gets for their rowkeys instead of taking
advantage of HBase's scan performance.
On Tue, Sep 24, 2013 at 8:57 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> Since different people use different terms... Salting is BAD. (You need to
> understand what is implied by the term salt.)
> What you really want to do is take the hash of the key, and then truncate
> the hash. Use that instead of a salt.
> Much better than a salt.
> Sent from a remote device. Please excuse any typos...
> Mike Segel
> > On Sep 24, 2013, at 5:17 PM, "Varun Sharma" <[EMAIL PROTECTED]> wrote:
> > Its better to do some "salting" in your keys for the reduce phase.
> > Basically, make ur key be something like "KeyHash + Key" and then decode
> > in your reducer and write to HBase. This way you avoid the hotspotting
> > problem on HBase due to MapReduce sorting.
> > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> > [EMAIL PROTECTED]> wrote:
> >> Hi Jeremy,
> >> I don't see any issue for HBase to handle 4000 tables. However, I don't
> >> think it's the best solution for your use case.
> >> JM
> >> 2013/9/24 jeremy p <[EMAIL PROTECTED]>
> >>> Short description : I'd like to have 4000 tables in my HBase cluster.
> >> Will
> >>> this be a problem? In general, what problems do you run into when you
> >> try
> >>> to host thousands of tables in a cluster?
> >>> Long description : I'd like the performance advantage of pre-split
> >> tables,
> >>> and I'd also like to do filtered range scans. Imagine a keyspace where
> >> the
> >>> key consists of : [POSITION]_[WORD] , where POSITION is a number from 1
> >> to
> >>> 4000, and WORD is a string consisting of 96 characters. The value in
> >>> cell would be a single integer. My app will examine a 'document',
> >>> each 'line' consists of 4000 WORDs. For each WORD, it'll do a filtered
> >>> regex lookup. Only problem? Say I have 200 mappers and they all start
> >> at
> >>> POSITION 1, my region servers would get hotspotted like crazy. So my
> >>> is to break it into 4000 tables (one for each POSITION), and then
> >> pre-split
> >>> the tables such that each region gets an equal amount of the traffic.
> >>> this scenario, the key would just be WORD. Dunno if this a bad idea,
> >> would
> >>> be open to suggestions
> >>> Thanks!
> >>> --J
*Michael Webster*, Software Engineer
Marketing solutions for commerce. Learn more.<http://www.bronto.com/platform>
jeremy p 2013-09-26, 19:25
Varun Sharma 2013-09-25, 04:55
jeremy p 2013-09-24, 22:52
Jean-Marc Spaggiari 2013-09-24, 23:16
Varun Sharma 2013-09-24, 23:22
Jean-Marc Spaggiari 2013-09-24, 23:32
jeremy p 2013-09-25, 00:11