Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
I think there's some hair-splitting going on here. The term "salting," by
strict definition [0] from the cryptographic context, means the
introduction of randomness to produce a one-way encoding of a value. The
technique for rowkey design described here does not include the
introduction of said randomness, nor is it a one-way operation. Instead, it
divides the input more or less evenly across a fixed number of buckets.

If you've introduced one-way randomness, you have no way to reproduce the
rowkey after it's been inserted. Thus, accessing a specific row would be
O(n) where n is the number of rows in the table -- you'd have to scan the
table looking for the desired row. Use of a bucketing technique, on the
other hand, adds a small amount of computation to calculate the true
rowkey, making access to a single row a O(1) + C vs the number of rows in
the table.

-n

[0]: http://en.wikipedia.org/wiki/Salt_(cryptography)

On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel <[EMAIL PROTECTED]>wrote:

> Lars,
>
> Ok... he's talking about buckets.
>
> So when you have N buckets, what is the least number of get()s do you need
> to fetch the single row?
> (Hint: The answer is N)
>
> How many scans? (N again)
>
> Do you disagree?
>
>
> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> > Mike, please think about what you write before you write it.
> > You will most definitely not need a full table scan (much less a *FULL*
> *TABLE* *SCAN* ;-) ).
> >
> > Read Alex's blog post again, it's a good post (IMHO). He is talking
> about buckets.
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> > From: Michael Segel <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Wednesday, December 19, 2012 5:23 PM
> > Subject: Re: Is it necessary to set MD5 on rowkey?
> >
> > Ok,
> >
> > Lets try this one more time...
> >
> > If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
> retrieve the row.
> > If you do something like a salt that uses only  a preset of N
> combinations, you will have to do N get()s in order to fetch the row.
> >
> > This is bad. VERY BAD.
> >
> > If you hash the row, you will get a consistent value each time you hash
> the key.  If you use SHA-1, the odds of a collision are mathematically
> possible, however highly improbable. So people have recommended that they
> append the key to the hash to form the new key. Here, you might as well as
> truncate the hash to just the most significant byte or two and the append
> the key. This will give you enough of an even distribution that you can
> avoid hot spotting.
> >
> > So if I use the hash, I can effectively still get the row of data back
> with a single get(). Otherwise its a full table scan.
> >
> > Do you see the difference?
> >
> >
> > On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
> >
> >> Hi Mike,
> >>
> >> If in your business case, the only thing you need when you retreive
> >> your data is to do full scan over MR jobs, then you can salt with
> >> what-ever you want. Hash, random values, etc.
> >>
> >> If you know you have x regions, then you can simply do a round-robin
> >> salting, or a random salting over those x regions.
> >>
> >> Then when you run your MR job, you discard the first bytes, and do
> >> what you want with your data.
> >>
> >> So I also think that salting can still be usefull. All depend on what
> >> you do with your data.
> >>
> >> Must my opinion.
> >>
> >> JM
> >>
> >> 2012/12/19, Michael Segel <[EMAIL PROTECTED]>:
> >>> Ok...
> >>>
> >>> So you use a random byte or two at the front of the row.
> >>> How do you then use get() to find the row?
> >>> How do you do a partial scan()?
> >>>
> >>> Do you start to see the problem?
> >>> The only way to get to the row is to do a full table scan. That kills
> HBase
> >>> and you would be better off going with a partitioned Hive table.
> >>>
> >>> Using a hash of the key or a portion of the hash is not a salt.