I am testing HBase for distinct counters - more concretely, counting
unique users from a fairly large stream of user_ids. For some time to
come the volume will be limited enough to use exact counting rather
than approximation but already it's too big to hold the entire set of
user_ids in memory.
For now I am basically inserting all elements from the stream into a
"user" table which has row key "user_id" as to enforce the unique
a) Is there a way to get a quick (i.e with small delay in a user
interface) count of the size of the user table to return the number of
users? Alternatively, is there a way to trigger an increment in
another table (say "count") whenever a row was added to "user"? I
guess this can be picked up eventually by the client application but I
don't want this to delay the actual stream processing.
b) I heard about Bloom filters in HBase but failed to understand if
they are used for row keys as well. Are they? How do I activate it? I
was looking to reduce the work-load of checking set membership for
every user_id in the stream. If this is done by HBase internally even
c) Eventually, I want to store distinct users by day and then do
unions on different days to get the total amount of unique users for a
multi-day period. Is this likely to involve a Map Reduce or is there a
more "light-weight" approach?