Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Distinct counters and counting rows


Copy link to this message
-
RE: Distinct counters and counting rows
Ramkrishna.S.Vasudevan 2012-05-30, 07:35
To answer this question
Alternatively, is there a way to trigger an increment in another table (say
"count") whenever a row was added to "user"?

You can try to use Coprocessors here.  Like once a put is done to the table
'user' using the coprocessor hooks you can trigger an Increment() operation
on table 'count'.
This can be done on one call from client.  Also the increment() operation
guarantees atomicity.

Hope this helps.

Regards
Ram
> -----Original Message-----
> From: David Koch [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, May 30, 2012 12:47 PM
> To: [EMAIL PROTECTED]
> Subject: Distinct counters and counting rows
>
> Hello,
>
> I am testing HBase for distinct counters - more concretely, counting
> unique users from a fairly large stream of user_ids. For some time to
> come the volume will be limited enough to use exact counting rather
> than approximation but already it's too big to hold the entire set of
> user_ids in memory.
>
> For now I am basically inserting all elements from the stream into a
> "user" table which has row key "user_id" as to enforce the unique
> constraint.
>
> My question:
> a) Is there a way to get a quick (i.e with small delay in a user
> interface) count of the size of the user table to return the number of
> users? Alternatively, is there a way to trigger an increment in
> another table (say "count") whenever a row was added to "user"? I
> guess this can be picked up eventually by the client application but I
> don't want this to delay the actual stream processing.
> b) I heard about Bloom filters in HBase but failed to understand if
> they are used for row keys as well. Are they? How do I activate it? I
> was looking to reduce the work-load of checking set membership for
> every user_id in the stream. If this is done by HBase internally even
> better.
> c) Eventually, I want to store distinct users by day and then do
> unions on different days to get the total amount of unique users for a
> multi-day period. Is this likely to involve a Map Reduce or is there a
> more "light-weight" approach?
>
> Thank you,
>
> /David