Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
I generally agree. I built a webtable design once. We dropped the scheme
and reversed the domain to support "suffix glob" type queries over a group
of related hosts. There is then a natural hotspot at "com" but salting
would only have dispersed queries that should go to one row (or a group of
adjacent rows) over multiple regionservers, actually hurting query
efficiency. Instead we set the region split threshold low in the beginning,
under the assumption that the resulting splits in the keyspace from the
initial stream of URLs would approximate the overall distribution, then
turned up the split threshold when entering production steady state.
On Wed, Dec 19, 2012 at 2:15 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> On Wed, Dec 19, 2012 at 1:26 PM, David Arthur <[EMAIL PROTECTED]> wrote:
>
> > Let's say you want to decompose a url into domain and path to include in
> > your row key.
> >
> > You could of course just use the url as the key, but you will see
> > hotspotting since most will start with "http".
>
>
> Doesn't the original Bigtable paper [0] design around this problem by
> dropping the protocol and only storing the domain? *goes to check* Yes, it
> does.
>
> Personally, I've never encountered an HBase schema design problem where
> salting really nailed it. It's an okay place to start with initial designs,
> especially if you don't know your data well. I'm a big fan of using the
> natural variance in the data itself to solve this problem. OpenTSDB does
> this quite well, IMHO. Plus, it's kind of a game or data puzzle -- how to
> use the data's nature to your advantage :)
>
> Just my 2¢
> -n
>
> [0]:
>
> http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf
>

--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
+
Michael Segel 2012-12-20, 20:15
+
k8 robot 2013-02-06, 01:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB