Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Is there a problem with having 4000 tables in a cluster?


+
jeremy p 2013-09-24, 21:34
+
Jean-Marc Spaggiari 2013-09-24, 21:50
+
Varun Sharma 2013-09-24, 22:16
+
Michael Segel 2013-09-25, 00:57
+
Michael Webster 2013-09-25, 05:16
+
jeremy p 2013-09-26, 19:25
+
Varun Sharma 2013-09-25, 04:55
+
jeremy p 2013-09-24, 22:52
+
Jean-Marc Spaggiari 2013-09-24, 23:16
Copy link to this message
-
Re: Is there a problem with having 4000 tables in a cluster?
So you should salt the keys in the reduce phase but u donot salt the keys
in HBase. That basically means that reducers do not see the keys in sorted
order but they do see all the values for a specific key together.

So the Hash essentially is a trick that stays within the mapreduce does not
make it into HBase. This prevents you from hotspotting a region and since
the Hash is not being written into HBase - you can do your prefix/regex
scans etc.
On Tue, Sep 24, 2013 at 4:16 PM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> If you have a fixed length like:
> XXXX_AAAAAAA
>
> Where XXXX is a number from 0000 to 4000 and AAAA is your word, then simply
> split by the number?
>
> Then when you will instead each line, it will write to 4000 different
> regions, which can be hosted in 4000 different servers if you have that.
> And there will be no hot-spotting?
>
> Then when you run MR job, you will have one mapper per region. Each region
> will be evenly loaded. And if you start 200 jobs, then each region will
> have 200 mappers, still evenly loaded, no?
>
> Or I might be missing something from your usecase.
>
> JM
>
>
> 2013/9/24 jeremy p <[EMAIL PROTECTED]>
>
> > Varun : I'm familiar with that method of salting.  However, in this
> case, I
> > need to do filtered range scans.  When I do a lookup for a given WORD at
> a
> > given POSITION, I'll actually be doing a regex on a range of WORDs at
> that
> > POSITION.  If I salt the keys with a hash, the WORDs will no longer be
> > sorted, and so I would need to do a full table scan for every lookup.
> >
> > Jean-Marc : What problems do you see with my solution?  Do you have a
> > better suggestion?
> >
> > --Jeremy
> >
> >
> > On Tue, Sep 24, 2013 at 3:16 PM, Varun Sharma <[EMAIL PROTECTED]>
> wrote:
> >
> > > Its better to do some "salting" in your keys for the reduce phase.
> > > Basically, make ur key be something like "KeyHash + Key" and then
> decode
> > it
> > > in your reducer and write to HBase. This way you avoid the hotspotting
> > > problem on HBase due to MapReduce sorting.
> > >
> > >
> > > On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Hi Jeremy,
> > > >
> > > > I don't see any issue for HBase to handle 4000 tables. However, I
> don't
> > > > think it's the best solution for your use case.
> > > >
> > > > JM
> > > >
> > > >
> > > > 2013/9/24 jeremy p <[EMAIL PROTECTED]>
> > > >
> > > > > Short description : I'd like to have 4000 tables in my HBase
> cluster.
> > > >  Will
> > > > > this be a problem?  In general, what problems do you run into when
> > you
> > > > try
> > > > > to host thousands of tables in a cluster?
> > > > >
> > > > > Long description : I'd like the performance advantage of pre-split
> > > > tables,
> > > > > and I'd also like to do filtered range scans.  Imagine a keyspace
> > where
> > > > the
> > > > > key consists of : [POSITION]_[WORD] , where POSITION is a number
> > from 1
> > > > to
> > > > > 4000, and WORD is a string consisting of 96 characters.  The value
> in
> > > the
> > > > > cell would be a single integer.  My app will examine a 'document',
> > > where
> > > > > each 'line' consists of 4000 WORDs.  For each WORD, it'll do a
> > filtered
> > > > > regex lookup.  Only problem?  Say I have 200 mappers and they all
> > start
> > > > at
> > > > > POSITION 1, my region servers would get hotspotted like crazy. So
> my
> > > idea
> > > > > is to break it into 4000 tables (one for each POSITION), and then
> > > > pre-split
> > > > > the tables such that each region gets an equal amount of the
> traffic.
> > >  In
> > > > > this scenario, the key would just be WORD.  Dunno if this a bad
> idea,
> > > > would
> > > > > be open to suggestions
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --J
> > > > >
> > > >
> > >
> >
>
+
Jean-Marc Spaggiari 2013-09-24, 23:32
+
jeremy p 2013-09-25, 00:11
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB