Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Timestamp as a key good practice?


+
Jean-Marc Spaggiari 2012-06-13, 16:16
+
Otis Gospodnetic 2012-06-14, 06:06
+
Jean-Marc Spaggiari 2012-06-14, 10:39
+
Michael Segel 2012-06-14, 11:55
+
Jean-Marc Spaggiari 2012-06-14, 12:22
+
Michael Segel 2012-06-14, 18:14
+
Jean-Marc Spaggiari 2012-06-14, 18:47
+
Michael Segel 2012-06-14, 19:46
+
Michael Segel 2012-06-15, 14:21
+
Jean-Marc Spaggiari 2012-06-16, 11:22
Copy link to this message
-
Re: Timestamp as a key good practice?
Michel Segel 2012-06-16, 14:35
You can't salt the key in the second table.
By salting the key, you lose the ability to do range scans, which is what you want to do.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote:

> Thanks all for your comments and suggestions. Regarding the
> hotspotting I will try to salt the key in the 2nd table and see the
> results.
>
> Yesterday I finished to install my 4 servers cluster with old machine.
> It's slow, but it's working. So I will do some testing.
>
> You are recommending to modify the timestamp to be to the second or
> minute and have more entries per row. Is that because it's better to
> have more columns than rows? Or it's more because that will allow to
> have a more "squarred" pattern (lot of rows, lot of colums) which if
> more efficient?
>
> JM
>
> 2012/6/15, Michael Segel <[EMAIL PROTECTED]>:
>> Thought about this a little bit more...
>>
>> You will want two tables for a solution.
>>
>> 1 Table is  Key: Unique ID
>>                    Column: FilePath            Value: Full Path to file
>>                    Column: Last Update time    Value: timestamp
>>
>> 2 Table is Key: Last Update time    (The timestamp)
>>                            Column 1-N: Unique ID    Value: Full Path to the
>> file
>>
>> Now if you want to get fancy,  in Table 1, you could use the time stamp on
>> the column File Path to hold the last update time.
>> But its probably easier for you to start by keeping the data as a separate
>> column and ignore the Timestamps on the columns for now.
>>
>> Note the following:
>>
>> 1) I used the notation Column 1-N to reflect that for a given timestamp you
>> may or may not have multiple files that were updated. (You weren't specific
>> as to the scale)
>> This is a good example of HBase's column oriented approach where you may or
>> may not have a column. It doesn't matter. :-) You could also modify the
>> timestamp to be to the second or minute and have more entries per row. It
>> doesn't matter. You insert based on timestamp:columnName, value, so you will
>> add a column to this table.
>>
>> 2) First prove that the logic works. You insert/update table 1 to capture
>> the ID of the file and its last update time.  You then delete the old
>> timestamp entry in table 2, then insert new entry in table 2.
>>
>> 3) You store Table 2 in ascending order. Then when you want to find your
>> last 500 entries, you do a start scan at 0x000 and then limit the scan to
>> 500 rows. Note that you may or may not have multiple entries so as you walk
>> through the result set, you count the number of columns and stop when you
>> have 500 columns, regardless of the number of rows you've processed.
>>
>> This should solve your problem and be pretty efficient.
>> You can then work out the Coprocessors and add it to the solution to be even
>> more efficient.
>>
>>
>> With respect to 'hot-spotting' , can't be helped. You could hash your unique
>> ID in table 1, this will reduce the potential of a hotspot as the table
>> splits.
>> On table 2, because you have temporal data and you want to efficiently scan
>> a small portion of the table based on size, you will always scan the first
>> bloc, however as data rolls off and compression occurs, you will probably
>> have to do some cleanup. I'm not sure how HBase  handles splits that no
>> longer contain data. When you compress an empty split, does it go away?
>>
>> By switching to coprocessors, you now limit the update accessors to the
>> second table so you should still have pretty good performance.
>>
>> You may also want to look at Asynchronous HBase, however I don't know how
>> well it will work with Coprocessors or if you want to perform async
>> operations in this specific use case.
>>
>> Good luck, HTH...
>>
>> -Mike
>>
>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>
>>> Hi Michael,
>>>
>>> For now this is more a proof of concept than a production application.
+
Jean-Marc Spaggiari 2012-06-16, 14:42
+
Michael Segel 2012-06-16, 16:33
+
Rob Verkuylen 2012-06-16, 19:10
+
Jean-Marc Spaggiari 2012-06-21, 11:43
+
Michael Segel 2012-06-21, 14:20
+
Jean-Marc Spaggiari 2012-06-22, 19:43
+
Jean-Marc Spaggiari 2012-06-23, 02:20
+
Jean-Daniel Cryans 2012-06-26, 17:50
+
Jean-Marc Spaggiari 2012-06-26, 17:56
+
Jean-Daniel Cryans 2012-06-26, 18:12
+
Michael Segel 2012-06-26, 19:01
+
Jean-Marc Spaggiari 2012-06-26, 19:04
+
Doug Meil 2012-06-14, 21:18