Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Is there a problem with having 4000 tables in a cluster?


Copy link to this message
-
Re: Is there a problem with having 4000 tables in a cluster?
Its better to do some "salting" in your keys for the reduce phase.
Basically, make ur key be something like "KeyHash + Key" and then decode it
in your reducer and write to HBase. This way you avoid the hotspotting
problem on HBase due to MapReduce sorting.
On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> Hi Jeremy,
>
> I don't see any issue for HBase to handle 4000 tables. However, I don't
> think it's the best solution for your use case.
>
> JM
>
>
> 2013/9/24 jeremy p <[EMAIL PROTECTED]>
>
> > Short description : I'd like to have 4000 tables in my HBase cluster.
>  Will
> > this be a problem?  In general, what problems do you run into when you
> try
> > to host thousands of tables in a cluster?
> >
> > Long description : I'd like the performance advantage of pre-split
> tables,
> > and I'd also like to do filtered range scans.  Imagine a keyspace where
> the
> > key consists of : [POSITION]_[WORD] , where POSITION is a number from 1
> to
> > 4000, and WORD is a string consisting of 96 characters.  The value in the
> > cell would be a single integer.  My app will examine a 'document', where
> > each 'line' consists of 4000 WORDs.  For each WORD, it'll do a filtered
> > regex lookup.  Only problem?  Say I have 200 mappers and they all start
> at
> > POSITION 1, my region servers would get hotspotted like crazy. So my idea
> > is to break it into 4000 tables (one for each POSITION), and then
> pre-split
> > the tables such that each region gets an equal amount of the traffic.  In
> > this scenario, the key would just be WORD.  Dunno if this a bad idea,
> would
> > be open to suggestions
> >
> > Thanks!
> >
> > --J
> >
>