Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Timestamp as a key good practice?

Copy link to this message
Re: Timestamp as a key good practice?
Michel Segel 2012-06-16, 14:35
You can't salt the key in the second table.
By salting the key, you lose the ability to do range scans, which is what you want to do.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote:

> Thanks all for your comments and suggestions. Regarding the
> hotspotting I will try to salt the key in the 2nd table and see the
> results.
> Yesterday I finished to install my 4 servers cluster with old machine.
> It's slow, but it's working. So I will do some testing.
> You are recommending to modify the timestamp to be to the second or
> minute and have more entries per row. Is that because it's better to
> have more columns than rows? Or it's more because that will allow to
> have a more "squarred" pattern (lot of rows, lot of colums) which if
> more efficient?
> JM
> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>:
>> Thought about this a little bit more...
>> You will want two tables for a solution.
>> 1 Table is  Key: Unique ID
>>                    Column: FilePath            Value: Full Path to file
>>                    Column: Last Update time    Value: timestamp
>> 2 Table is Key: Last Update time    (The timestamp)
>>                            Column 1-N: Unique ID    Value: Full Path to the
>> file
>> Now if you want to get fancy,  in Table 1, you could use the time stamp on
>> the column File Path to hold the last update time.
>> But its probably easier for you to start by keeping the data as a separate
>> column and ignore the Timestamps on the columns for now.
>> Note the following:
>> 1) I used the notation Column 1-N to reflect that for a given timestamp you
>> may or may not have multiple files that were updated. (You weren't specific
>> as to the scale)
>> This is a good example of HBase's column oriented approach where you may or
>> may not have a column. It doesn't matter. :-) You could also modify the
>> timestamp to be to the second or minute and have more entries per row. It
>> doesn't matter. You insert based on timestamp:columnName, value, so you will
>> add a column to this table.
>> 2) First prove that the logic works. You insert/update table 1 to capture
>> the ID of the file and its last update time.  You then delete the old
>> timestamp entry in table 2, then insert new entry in table 2.
>> 3) You store Table 2 in ascending order. Then when you want to find your
>> last 500 entries, you do a start scan at 0x000 and then limit the scan to
>> 500 rows. Note that you may or may not have multiple entries so as you walk
>> through the result set, you count the number of columns and stop when you
>> have 500 columns, regardless of the number of rows you've processed.
>> This should solve your problem and be pretty efficient.
>> You can then work out the Coprocessors and add it to the solution to be even
>> more efficient.
>> With respect to 'hot-spotting' , can't be helped. You could hash your unique
>> ID in table 1, this will reduce the potential of a hotspot as the table
>> splits.
>> On table 2, because you have temporal data and you want to efficiently scan
>> a small portion of the table based on size, you will always scan the first
>> bloc, however as data rolls off and compression occurs, you will probably
>> have to do some cleanup. I'm not sure how HBase  handles splits that no
>> longer contain data. When you compress an empty split, does it go away?
>> By switching to coprocessors, you now limit the update accessors to the
>> second table so you should still have pretty good performance.
>> You may also want to look at Asynchronous HBase, however I don't know how
>> well it will work with Coprocessors or if you want to perform async
>> operations in this specific use case.
>> Good luck, HTH...
>> -Mike
>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>> Hi Michael,
>>> For now this is more a proof of concept than a production application.