Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Timestamp as a key good practice?


Copy link to this message
-
Re: Timestamp as a key good practice?
Michael Segel 2012-06-15, 14:21
Thought about this a little bit more...

You will want two tables for a solution.

1 Table is  Key: Unique ID
                    Column: FilePath Value: Full Path to file
                    Column: Last Update time Value: timestamp

2 Table is Key: Last Update time    (The timestamp)
                            Column 1-N: Unique ID Value: Full Path to the file

Now if you want to get fancy,  in Table 1, you could use the time stamp on the column File Path to hold the last update time.  
But its probably easier for you to start by keeping the data as a separate column and ignore the Timestamps on the columns for now.

Note the following:

1) I used the notation Column 1-N to reflect that for a given timestamp you may or may not have multiple files that were updated. (You weren't specific as to the scale)
This is a good example of HBase's column oriented approach where you may or may not have a column. It doesn't matter. :-) You could also modify the timestamp to be to the second or minute and have more entries per row. It doesn't matter. You insert based on timestamp:columnName, value, so you will add a column to this table.

2) First prove that the logic works. You insert/update table 1 to capture the ID of the file and its last update time.  You then delete the old timestamp entry in table 2, then insert new entry in table 2.

3) You store Table 2 in ascending order. Then when you want to find your last 500 entries, you do a start scan at 0x000 and then limit the scan to 500 rows. Note that you may or may not have multiple entries so as you walk through the result set, you count the number of columns and stop when you have 500 columns, regardless of the number of rows you've processed.

This should solve your problem and be pretty efficient.
You can then work out the Coprocessors and add it to the solution to be even more efficient.
With respect to 'hot-spotting' , can't be helped. You could hash your unique ID in table 1, this will reduce the potential of a hotspot as the table splits.
On table 2, because you have temporal data and you want to efficiently scan a small portion of the table based on size, you will always scan the first bloc, however as data rolls off and compression occurs, you will probably have to do some cleanup. I'm not sure how HBase  handles splits that no longer contain data. When you compress an empty split, does it go away?

By switching to coprocessors, you now limit the update accessors to the second table so you should still have pretty good performance.

You may also want to look at Asynchronous HBase, however I don't know how well it will work with Coprocessors or if you want to perform async operations in this specific use case.

Good luck, HTH...

-Mike

On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:

> Hi Michael,
>
> For now this is more a proof of concept than a production application.
> And if it's working, it should be growing a lot and database at the
> end will easily be over 1B rows. each individual server will have to
> send it's own information to one centralized server which will insert
> that into a database. That's why it need to be very quick and that's
> why I'm looking in HBase's direction. I tried with some relational
> databases with 4M rows in the table but the insert time is to slow
> when I have to introduce entries in bulk. Also, the ability for HBase
> to keep only the cells with values will allow to save a lot on the
> disk space (futur projects).
>
> I'm not yet used with HBase and there is still many things I need to
> undertsand but until I'm able to create a solution and test it, I will
> continue to read, learn and try that way. Then at then end I will be
> able to compare the 2 options I have (HBase or relational) and decide
> based on the results.
>
> So yes, your reply helped because it's giving me a way to achieve this
> goal (using co-processors). I don't know ye thow this part is working,
> so I will dig the documentation for it.
>