Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> crafting your key - scan vs. get


Copy link to this message
-
Re: crafting your key - scan vs. get
Neil,
Since you asked....
Actually your question is kind of a boring question. ;-) [Note I will probably get flamed for saying it, even if it is the truth!]

Having said that...
Boring as it is, its an important topic that many still seem to trivialize in terms of its impact on performance.

Before answering your question, lets take a step back and ask a more important question...
"What data do you to capture and store in HBase?"
and then ask yourself...
"How do I plan on accessing the data?"

From what I can tell, you want to track certain events made by a user.
So you're recording at Time X, user A did something.

Then the question is how do you want to access the data.

Do you primarily say "Show me all the events in the past 15 minutes and organize them by user?"
Or do you say "Show me the most recent events by user A" ?

Here's the issue.

If you are more interested and will frequently ask the question of "Show me the most recent events by user A",

Then you would want to do the following:
Key = User ID (hashed if necessary)
Column Family: Data (For lack of a better name)

Then store each event in a separate column where the column name is something like "event" + (max Long - Time Stamp) .

This will place the most recent event first.

The reason I say "event" + the long, is that you may want to place user specific information in a column and you would want to make sure it was in front of the event data.

Now if your access pattern was more along the lines of show me  the events that occurred in the past 15 minutes, then you would use the time stamp and then have to worry about hot spotting and region splits. But then you could get your data from a simple start/stop row scan.

In the first case, you can use get() while still a scan, its a very efficient fetch.
In the second, you will always need to do a scan.

Having said that, there are other things to think about including frequency and how wide your rows will get over time.
(Mainly in terms of the first example I gave.)

The reason I said that your question is boring is that its been asked numerous times and every time its asked, the initial question doesn't provide enough information to actually give a good answer...

HTH

-Mike

On Oct 16, 2012, at 4:53 PM, Neil Yalowitz <[EMAIL PROTECTED]> wrote:

> Hopefully this is a fun question.  :)
>
> Assume you could architect an HBase table from scratch and you were
> choosing between the following two key structures.
>
> 1)
>
> The first structure creates a unique row key for each PUT.  The rows are
> events related to a user ID.  There may be up to several hundred events for
> each user ID (probably not thousands, an average of perhaps ~100 events per
> user).  Each key would be made unique with a reverse-order-timestamp or
> perhaps just random characters (we don't particularly care about using ROT
> for sorting newest here).
>
> key
> ----
> AAAAAA + some-unique-chars
>
> The table will look like this:
>
> key                                   vals  cf:mycf                ts
> -------------------------------------------------------------------
> AAAAAA9999...                 myval1                 1350345600
> AAAAAA8888...                 myval2                 1350259200
> AAAAAA7777...                 myval3                 1350172800
>
>
> Retrieving these values will use a Scan with startRow and stopRow.  In
> hbase shell, it would look like:
>
> $ scan 'mytable',{STARTROW=>'AAAAAA', ENDROW=>'AAAAAA_'}
>
>
> 2)
>
> The second structure choice uses only the user ID as the key and relies on
> row versions to store all the events.  For example:
>
> key                           vals   cf:mycf                     ts
> ---------------------------------------------------------------------
> AAAAAA                    myval1                       1350345600
> AAAAAA                    myval2                       1350259200
> AAAAAA                    myval3                       1350172800
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB