Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Timestamp as a key good practice?


Copy link to this message
-
Re: Timestamp as a key good practice?
JM, have a look at https://github.com/sematext/HBaseWD (this comes up often.... Doug, maybe you could add it to the Ref Guide?)

Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 

>________________________________
> From: Jean-Marc Spaggiari <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Wednesday, June 13, 2012 12:16 PM
>Subject: Timestamp as a key good practice?
>
>I watched Lars George's video about HBase and read the documentation
>and it's saying that it's not a good idea to have the timestamp as a
>key because that will always load the same region until the timestamp
>reach a certain value and move to the next region (hotspotting).
>
>I have a table with a uniq key, a file path and a "last update" field.
>I can easily find back the file with the ID and find when it has been
>updated.
>
>But what I need too is to find the files not updated for more than a
>certain period of time.
>
>If I want to retrieve that from this single table, I will have to do a
>full parsing of the table. Which might take a while.
>
>So I thought of building a table to reference that (kind of secondary
>index). The key is the "last update", one FC and each column will have
>the ID of the file with a dummy content.
>
>When a file is updated, I remove its cell from this table, and
>introduce a new cell with the new timestamp as the key.
>
>And so one.
>
>With this schema, I can find the files by ID very quickly and I can
>find the files which need to be updated pretty quickly too. But it's
>hotspotting one region.
>
>From the video (0:45:10) I can see 4 situations.
>1) Hotspotting.
>2) Salting.
>3) Key field swap/promotion
>4) Randomization.
>
>I need to avoid hostpotting, so I looked at the 3 other options.
>
>I can do salting. Like prefix the timestamp with a number between 0
>and 9. So that will distribut the load over 10 servers. To find all
>the files with a timestamp below a specific value, I will need to run
>10 requests instead of one. But when the load will becaume to big for
>10 servers, I will have to prefix by a byte between 0 and 99? Which
>mean 100 request? And the more regions I will have, the more requests
>I will have to do. Is that really a good approach?
>
>Key field swap is close to salting. I can add the first few bytes from
>the path before the timestamp, but the issue will remain the same.
>
>I looked and randomization, and I can't do that. Else I will have no
>way to retreive the information I'm looking for.
>
>So the question is. Is there a good way to store the data to retrieve
>them base on the date?
>
>Thanks,
>
>JM
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB