Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Distinct counters and counting rows


+
David Koch 2012-05-30, 07:17
+
Ramkrishna.S.Vasudevan 2012-05-30, 07:35
+
Andrew Purtell 2012-05-30, 22:32
+
Andrew Purtell 2012-05-30, 22:39
Copy link to this message
-
Re: Distinct counters and counting rows
Hi David,

Have a look at Coprocessors which can enable you run custom code(Observers)
on get/put/delete actions on Table. You can easily implement the counters
with the help of that.
Here is the description for coprocessors:
https://blogs.apache.org/hbase/entry/coprocessor_introduction

HTH,
Anil
On Wed, May 30, 2012 at 12:17 AM, David Koch <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I am testing HBase for distinct counters - more concretely, counting
> unique users from a fairly large stream of user_ids. For some time to
> come the volume will be limited enough to use exact counting rather
> than approximation but already it's too big to hold the entire set of
> user_ids in memory.
>
> For now I am basically inserting all elements from the stream into a
> "user" table which has row key "user_id" as to enforce the unique
> constraint.
>
> My question:
> a) Is there a way to get a quick (i.e with small delay in a user
> interface) count of the size of the user table to return the number of
> users? Alternatively, is there a way to trigger an increment in
> another table (say "count") whenever a row was added to "user"? I
> guess this can be picked up eventually by the client application but I
> don't want this to delay the actual stream processing.
> b) I heard about Bloom filters in HBase but failed to understand if
> they are used for row keys as well. Are they? How do I activate it? I
> was looking to reduce the work-load of checking set membership for
> every user_id in the stream. If this is done by HBase internally even
> better.
> c) Eventually, I want to store distinct users by day and then do
> unions on different days to get the total amount of unique users for a
> multi-day period. Is this likely to involve a Map Reduce or is there a
> more "light-weight" approach?
>
> Thank you,
>
> /David
>

--
Thanks & Regards,
Anil Gupta