Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Distinct counters and counting rows


Copy link to this message
-
Distinct counters and counting rows
Hello,

I am testing HBase for distinct counters - more concretely, counting
unique users from a fairly large stream of user_ids. For some time to
come the volume will be limited enough to use exact counting rather
than approximation but already it's too big to hold the entire set of
user_ids in memory.

For now I am basically inserting all elements from the stream into a
"user" table which has row key "user_id" as to enforce the unique
constraint.

My question:
a) Is there a way to get a quick (i.e with small delay in a user
interface) count of the size of the user table to return the number of
users? Alternatively, is there a way to trigger an increment in
another table (say "count") whenever a row was added to "user"? I
guess this can be picked up eventually by the client application but I
don't want this to delay the actual stream processing.
b) I heard about Bloom filters in HBase but failed to understand if
they are used for row keys as well. Are they? How do I activate it? I
was looking to reduce the work-load of checking set membership for
every user_id in the stream. If this is done by HBase internally even
better.
c) Eventually, I want to store distinct users by day and then do
unions on different days to get the total amount of unique users for a
multi-day period. Is this likely to involve a Map Reduce or is there a
more "light-weight" approach?

Thank you,

/David
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB