-Re: crafting your key - scan vs. get
Michael Segel 2012-10-18, 08:23
I've pointed you in the right direction.
The rest of the exercise is left to the student. :-)
While you used the comment about having fun, your question is boring. *^1
The fun part is for you now to play and see why I may have suggested the importance of column order.
Sorry, but that really is the fun part of your question... figuring out the rest of the answer on your own.
From your response, you clearly understand it, but you need to spend more time wrapping your head around the solution and taking ownership of it.
*^1 The reason I say that the question is boring is that once you fully understand the problem and the solution, you can easily apply it to other problems. The fun is in actually taking the time to experiment and work through the problem on your own. Seriously, that *is* the fun part.
On Oct 17, 2012, at 10:53 PM, Neil Yalowitz <[EMAIL PROTECTED]> wrote:
> This is a helpful response, thanks. Our use case fits the "Show me the
> most recent events by user A" you described.
> So using the first example, a table populated with events of user ID AAAAAA.
> ROW COLUMN+CELL
> column=data:event9999, timestamp=1350420705459, value=myeventval1
> column=data:event9998, timestamp=1350420704490, value=myeventval2
> column=data:event9997, timestamp=1350420704567, value=myeventval3
> NOTE1: I replaced the TS stuff with 9999...9997 for brevity, and the
> example user ID "AAAAAA" would actually be hashed to avoid hotspotting
> NOTE2: I assume I should shorten the chosen column family and qualifier
> before writing it to a large production table (for instance, d instead of
> data and e instead of event)
> I hope I have that right. Thanks for the response!
> As for including enough description for the question to be "not-boring,"
> I'm never quite sure when an email will grow so long that no one will read
> it. :) So to give more background: Each event is about 1KB of data. The
> frequency is highly variable... over any given period of time, some users
> may only log one event and no more, some users may log a few events (10 to
> 100), in some rare cases a user may log many events (1000+). The width of
> the column is some concern for the users with many events, but I'm thinking
> a few rare rows with 1KB x 1000+ width shouldn't kill us.
> If I may ask a couple of followup question about your comments:
>> Then store each event in a separate column where the column name is
> something like "event" + (max Long - Time Stamp) .
>> This will place the most recent event first.
> Although I know row keys are sorted, I'm not sure what this means for a
> qualifier. The scan result can depend on what cf:qual is used? ...and
> that determines which column value is "first"? Is this related to using
> setMaxResultsPerColumnFamily(1)? (ie-- only return one column value, so
> sort on qualifier and return the first val found)
>> The reason I say "event" + the long, is that you may want to place user
> specific information in a column and you would want to make sure it was in
> front of the event data.
> Same question as above, I'm not sure what would place a column "in front."
> Am I missing something?
>> In the first case, you can use get() while still a scan, its a very
> efficient fetch.
>> In the second, you will always need to do a scan.
> This is the core of my original question. My anecdotal tests in hbase
> shell showed a Get executing about 3x faster than a Scan with
> start/stoprow, but I don't trust my crude testing much and hoped someone
> could describe the performance trade-off between Scan vs. Get.
> Thanks again for anyone who read this far.
> Neil Yalowitz
> [EMAIL PROTECTED]
> On Wed, Oct 17, 2012 at 10:45 AM, Michael Segel
> <[EMAIL PROTECTED]>wrote:
>> Since you asked....
>> Actually your question is kind of a boring question. ;-) [Note I will