Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Timestamp as a key good practice?

Copy link to this message
Re: Timestamp as a key good practice?

You indicated that you didn't want to do full table scans when you want to find out which files hadn't been touched since X time has past.
(X could be months, weeks, days, hours, etc ...)

So here's the thing.
First,  I am not convinced that you will have hot spotting.
Second, you end up having to now do 26 scans instead of one. Then you need to join the result set.

Not really a good solution if you think about it.

Oh and I don't believe that you will be hitting a single region, although you may hit  a region hard.
(Your second table's key is on the timestamp of the last update to the file.  If the file hadn't been touched in a week, there's the probability that at scale, it won't be in the same region as a file that had recently been touched. )

I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only be applied on a subset of problems.
(Think round-robin partitioning in a RDBMS. DB2 was big on this.)



On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:

> Let's imagine the timestamp is "123456789".
> If I salt it with later from 'a' to 'z' them it will always be split
> between few RegionServers. I will have like "t123456789". The issue is
> that I will have to do 26 queries to be able to find all the entries.
> I will need to query from A000000000 to Axxxxxxxxx, then same for B,
> and so on.
> So what's worst? Am I better to deal with the hotspotting? Salt the
> key myself? Or what if I use something like HBaseWD?
> JM
> 2012/6/16, Michel Segel <[EMAIL PROTECTED]>:
>> You can't salt the key in the second table.
>> By salting the key, you lose the ability to do range scans, which is what
>> you want to do.
>> Sent from a remote device. Please excuse any typos...
>> Mike Segel
>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>> wrote:
>>> Thanks all for your comments and suggestions. Regarding the
>>> hotspotting I will try to salt the key in the 2nd table and see the
>>> results.
>>> Yesterday I finished to install my 4 servers cluster with old machine.
>>> It's slow, but it's working. So I will do some testing.
>>> You are recommending to modify the timestamp to be to the second or
>>> minute and have more entries per row. Is that because it's better to
>>> have more columns than rows? Or it's more because that will allow to
>>> have a more "squarred" pattern (lot of rows, lot of colums) which if
>>> more efficient?
>>> JM
>>> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>:
>>>> Thought about this a little bit more...
>>>> You will want two tables for a solution.
>>>> 1 Table is  Key: Unique ID
>>>>                   Column: FilePath            Value: Full Path to file
>>>>                   Column: Last Update time    Value: timestamp
>>>> 2 Table is Key: Last Update time    (The timestamp)
>>>>                           Column 1-N: Unique ID    Value: Full Path to
>>>> the
>>>> file
>>>> Now if you want to get fancy,  in Table 1, you could use the time stamp
>>>> on
>>>> the column File Path to hold the last update time.
>>>> But its probably easier for you to start by keeping the data as a
>>>> separate
>>>> column and ignore the Timestamps on the columns for now.
>>>> Note the following:
>>>> 1) I used the notation Column 1-N to reflect that for a given timestamp
>>>> you
>>>> may or may not have multiple files that were updated. (You weren't
>>>> specific
>>>> as to the scale)
>>>> This is a good example of HBase's column oriented approach where you may
>>>> or
>>>> may not have a column. It doesn't matter. :-) You could also modify the
>>>> timestamp to be to the second or minute and have more entries per row.
>>>> It
>>>> doesn't matter. You insert based on timestamp:columnName, value, so you
>>>> will
>>>> add a column to this table.
>>>> 2) First prove that the logic works. You insert/update table 1 to
>>>> capture
>>>> the ID of the file and its last update time.  You then delete the old