Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> key design

Copy link to this message
Re: key design

Why separate tables per log type? Why not a single table with the key:

<log type><date>

That's roughly the approach used by OpenTSDB (with "metric id" instead of "log type", but same idea). OpenTSDB goes further by "bucketing" values into rows using a base timestamp in the row key and offset timestamps in the column qualifiers, for more efficiency.

If you start the key with log type, you can do partial scans for a specific date, but only within a single log type; to scan across all log types, you'd need to do multiple scans (one per log type). If you have a fixed and relatively small number of log types (less than 20, say), this could still be the best approach, but if it's a very frequent operation to scan by time across all log types and you have a lot of log types, you might want to reconsider that.

The case for using a hash as the start of the key is really just to avoid region server "hot spotting" (where, even though you have lots of machines, all your insert traffic is going to one of them because all inserts are happening "now" and only one region server contains the range that "now" is in). Salting or hashing a timestamp based key spreads that out so the load is evenly distributed; but it prevents you from doing linear scans over the time dimension. That's why OpenTSDB (and similar models) start the key with another value that "spreads" the data over all servers.


On May 21, 2012, at 7:56 AM, mete wrote:

> Hello folks,
> i am trying to come up with a nice key design for storing logs in the
> company. I am planning to index them  and store row key in the index for
> random reads.
> I need to balance the writes equally between the R.S. and i could not
> understand how opentsdb does that with prefixing the metric id. (i related
> metric id with the log type) In my log storage case a log line just has a
> type and a date and the rest of it is not really very useful information.
> So i think that i can create a table for every distinct log type and i need
> a random salt to route to a different R.S. similar to this:
> <salt>-<date>
> But with this approach i believe i will lose the ability to do effective
> partial scans to a specific date. (if for some reason i need that) What do
> you think? And for the salt approach do you use randomly generated salts or
> hashes that actually mean something? (like the hash of the date)
> I am using random uuids at the moment but i am trying to find a better
> approach, any feedback is welcome
> cheers
> Mete