Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Timestamp as a key good practice?

Copy link to this message
Re: Timestamp as a key good practice?
JM, have a look at https://github.com/sematext/HBaseWD (this comes up often.... Doug, maybe you could add it to the Ref Guide?)

Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 

> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>Sent: Wednesday, June 13, 2012 12:16 PM
>Subject: Timestamp as a key good practice?
>I watched Lars George's video about HBase and read the documentation
>and it's saying that it's not a good idea to have the timestamp as a
>key because that will always load the same region until the timestamp
>reach a certain value and move to the next region (hotspotting).
>I have a table with a uniq key, a file path and a "last update" field.
>I can easily find back the file with the ID and find when it has been
>But what I need too is to find the files not updated for more than a
>certain period of time.
>If I want to retrieve that from this single table, I will have to do a
>full parsing of the table. Which might take a while.
>So I thought of building a table to reference that (kind of secondary
>index). The key is the "last update", one FC and each column will have
>the ID of the file with a dummy content.
>When a file is updated, I remove its cell from this table, and
>introduce a new cell with the new timestamp as the key.
>And so one.
>With this schema, I can find the files by ID very quickly and I can
>find the files which need to be updated pretty quickly too. But it's
>hotspotting one region.
>From the video (0:45:10) I can see 4 situations.
>1) Hotspotting.
>2) Salting.
>3) Key field swap/promotion
>4) Randomization.
>I need to avoid hostpotting, so I looked at the 3 other options.
>I can do salting. Like prefix the timestamp with a number between 0
>and 9. So that will distribut the load over 10 servers. To find all
>the files with a timestamp below a specific value, I will need to run
>10 requests instead of one. But when the load will becaume to big for
>10 servers, I will have to prefix by a byte between 0 and 99? Which
>mean 100 request? And the more regions I will have, the more requests
>I will have to do. Is that really a good approach?
>Key field swap is close to salting. I can add the first few bytes from
>the path before the timestamp, but the issue will remain the same.
>I looked and randomization, and I can't do that. Else I will have no
>way to retreive the information I'm looking for.
>So the question is. Is there a good way to store the data to retrieve
>them base on the date?