Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Timestamp as a key good practice?

Copy link to this message
Re: Timestamp as a key good practice?
You can't salt the key in the second table.
By salting the key, you lose the ability to do range scans, which is what you want to do.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote:

> Thanks all for your comments and suggestions. Regarding the
> hotspotting I will try to salt the key in the 2nd table and see the
> results.
> Yesterday I finished to install my 4 servers cluster with old machine.
> It's slow, but it's working. So I will do some testing.
> You are recommending to modify the timestamp to be to the second or
> minute and have more entries per row. Is that because it's better to
> have more columns than rows? Or it's more because that will allow to
> have a more "squarred" pattern (lot of rows, lot of colums) which if
> more efficient?
> JM
> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>:
>> Thought about this a little bit more...
>> You will want two tables for a solution.
>> 1 Table is  Key: Unique ID
>>                    Column: FilePath            Value: Full Path to file
>>                    Column: Last Update time    Value: timestamp
>> 2 Table is Key: Last Update time    (The timestamp)
>>                            Column 1-N: Unique ID    Value: Full Path to the
>> file
>> Now if you want to get fancy,  in Table 1, you could use the time stamp on
>> the column File Path to hold the last update time.
>> But its probably easier for you to start by keeping the data as a separate
>> column and ignore the Timestamps on the columns for now.
>> Note the following:
>> 1) I used the notation Column 1-N to reflect that for a given timestamp you
>> may or may not have multiple files that were updated. (You weren't specific
>> as to the scale)
>> This is a good example of HBase's column oriented approach where you may or
>> may not have a column. It doesn't matter. :-) You could also modify the
>> timestamp to be to the second or minute and have more entries per row. It
>> doesn't matter. You insert based on timestamp:columnName, value, so you will
>> add a column to this table.
>> 2) First prove that the logic works. You insert/update table 1 to capture
>> the ID of the file and its last update time.  You then delete the old
>> timestamp entry in table 2, then insert new entry in table 2.
>> 3) You store Table 2 in ascending order. Then when you want to find your
>> last 500 entries, you do a start scan at 0x000 and then limit the scan to
>> 500 rows. Note that you may or may not have multiple entries so as you walk
>> through the result set, you count the number of columns and stop when you
>> have 500 columns, regardless of the number of rows you've processed.
>> This should solve your problem and be pretty efficient.
>> You can then work out the Coprocessors and add it to the solution to be even
>> more efficient.
>> With respect to 'hot-spotting' , can't be helped. You could hash your unique
>> ID in table 1, this will reduce the potential of a hotspot as the table
>> splits.
>> On table 2, because you have temporal data and you want to efficiently scan
>> a small portion of the table based on size, you will always scan the first
>> bloc, however as data rolls off and compression occurs, you will probably
>> have to do some cleanup. I'm not sure how HBase  handles splits that no
>> longer contain data. When you compress an empty split, does it go away?
>> By switching to coprocessors, you now limit the update accessors to the
>> second table so you should still have pretty good performance.
>> You may also want to look at Asynchronous HBase, however I don't know how
>> well it will work with Coprocessors or if you want to perform async
>> operations in this specific use case.
>> Good luck, HTH...
>> -Mike
>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>> Hi Michael,
>>> For now this is more a proof of concept than a production application.