Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
I think there's some hair-splitting going on here. The term "salting," by
strict definition [0] from the cryptographic context, means the
introduction of randomness to produce a one-way encoding of a value. The
technique for rowkey design described here does not include the
introduction of said randomness, nor is it a one-way operation. Instead, it
divides the input more or less evenly across a fixed number of buckets.

If you've introduced one-way randomness, you have no way to reproduce the
rowkey after it's been inserted. Thus, accessing a specific row would be
O(n) where n is the number of rows in the table -- you'd have to scan the
table looking for the desired row. Use of a bucketing technique, on the
other hand, adds a small amount of computation to calculate the true
rowkey, making access to a single row a O(1) + C vs the number of rows in
the table.

-n

[0]: http://en.wikipedia.org/wiki/Salt_(cryptography)

On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel <[EMAIL PROTECTED]>wrote:

> Lars,
>
> Ok... he's talking about buckets.
>
> So when you have N buckets, what is the least number of get()s do you need
> to fetch the single row?
> (Hint: The answer is N)
>
> How many scans? (N again)
>
> Do you disagree?
>
>
> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> > Mike, please think about what you write before you write it.
> > You will most definitely not need a full table scan (much less a *FULL*
> *TABLE* *SCAN* ;-) ).
> >
> > Read Alex's blog post again, it's a good post (IMHO). He is talking
> about buckets.
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> > From: Michael Segel <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Wednesday, December 19, 2012 5:23 PM
> > Subject: Re: Is it necessary to set MD5 on rowkey?
> >
> > Ok,
> >
> > Lets try this one more time...
> >
> > If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
> retrieve the row.
> > If you do something like a salt that uses only  a preset of N
> combinations, you will have to do N get()s in order to fetch the row.
> >
> > This is bad. VERY BAD.
> >
> > If you hash the row, you will get a consistent value each time you hash
> the key.  If you use SHA-1, the odds of a collision are mathematically
> possible, however highly improbable. So people have recommended that they
> append the key to the hash to form the new key. Here, you might as well as
> truncate the hash to just the most significant byte or two and the append
> the key. This will give you enough of an even distribution that you can
> avoid hot spotting.
> >
> > So if I use the hash, I can effectively still get the row of data back
> with a single get(). Otherwise its a full table scan.
> >
> > Do you see the difference?
> >
> >
> > On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
> >
> >> Hi Mike,
> >>
> >> If in your business case, the only thing you need when you retreive
> >> your data is to do full scan over MR jobs, then you can salt with
> >> what-ever you want. Hash, random values, etc.
> >>
> >> If you know you have x regions, then you can simply do a round-robin
> >> salting, or a random salting over those x regions.
> >>
> >> Then when you run your MR job, you discard the first bytes, and do
> >> what you want with your data.
> >>
> >> So I also think that salting can still be usefull. All depend on what
> >> you do with your data.
> >>
> >> Must my opinion.
> >>
> >> JM
> >>
> >> 2012/12/19, Michael Segel <[EMAIL PROTECTED]>:
> >>> Ok...
> >>>
> >>> So you use a random byte or two at the front of the row.
> >>> How do you then use get() to find the row?
> >>> How do you do a partial scan()?
> >>>
> >>> Do you start to see the problem?
> >>> The only way to get to the row is to do a full table scan. That kills
> HBase
> >>> and you would be better off going with a partitioned Hive table.
> >>>
> >>> Using a hash of the key or a portion of the hash is not a salt.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB