Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Add client complexity or use a coprocessor?


+
Tom Brown 2012-04-09, 16:48
+
Andrew Purtell 2012-04-09, 18:28
+
Tom Brown 2012-04-10, 05:14
+
Andrew Purtell 2012-04-10, 18:01
+
Tom Brown 2012-04-10, 22:53
+
Andrew Purtell 2012-04-10, 23:53
+
Tom Brown 2012-04-11, 17:37
+
Andrew Purtell 2012-04-13, 22:32
+
kisalay 2012-04-11, 05:59
+
Tom Brown 2012-04-11, 17:41
+
kisalay 2012-04-11, 19:32
+
Jacques 2012-04-10, 06:01
+
Tom Brown 2012-04-10, 16:19
Copy link to this message
-
Re: Add client complexity or use a coprocessor?
On Tue, Apr 10, 2012 at 9:19 AM, Tom Brown <[EMAIL PROTECTED]> wrote:

> Jacques,
>
> The technique I've been trying to use is similar to a bloom filter
> (except that it's more space efficient).
Got it.  I didn't realize.
> It's my understanding that
> bloom filters in HBase are only implemented in the context of finding
> individual columns (for improving read performance). Are there
> specific bloom operations I can use atomically on a specific cell?
>

Your understanding is correct.  My statement was about using the data
structure as a compressed version of a duplication filter, not any HBase
feature.

> Thanks!
>
> --Tom
>
> On Tue, Apr 10, 2012 at 12:01 AM, Jacques <[EMAIL PROTECTED]> wrote:
> > What about maintaining a bloom filter in addition to an increment to
> > minimize double counting? You couldn't do atomic without some custom work
> > but it would get u mostly there.  If you wanted to be fancy you could
> > actually maintain the bloom as a bunch of separate colums to avoid update
> > contention.
> > On Apr 9, 2012 10:14 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote:
> >
> >> Andy,
> >>
> >> I am a big fan of the Increment class. Unfortunately, I'm not doing
> >> simple increments for the viewer count. I will be receiving duplicate
> >> messages from a particular client for a specific cube cell, and don't
> >> want them to be counted twice (my stats don't have to be 100%
> >> accurate, but the expected rate of duplicates will be higher than the
> >> allowable error rate).
> >>
> >> I created an RPC endpoint coprocessor to perform this function but
> >> performance suffered heavily under load (it appears that the endpoint
> >> performs all functions in serial).
> >>
> >> When I tried implementing it as a region observer, I was unsure of how
> >> to correctly replace the provided "put" with my own. When I issued a
> >> put from within "prePut", the server blocked the new put (waiting for
> >> the "prePut" to finish). Should I be attempting to modify the WALEdit
> >> object?
> >>
> >> Is there a way to extend the functionality of "Increment" to provide
> >> arbitrary bitwise operations on a the contents of a field?
> >>
> >> Thanks again!
> >>
> >> --Tom
> >>
> >> >If it helps, yes this is possible:
> >> >
> >> >> Can I observe updates to a
> >> >> particular table and replace the provided data with my own? (The
> >> >> client calls "put" with the actual user ID, my co-processor replaces
> >> >> it with a computed value, so the actual user ID never gets stored in
> >> >> HBase).
> >> >
> >> >Since your option #2 requires atomic updates to the data structure,
> have
> >> you considered native
> >> >atomic increments? See
> >> >
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
> >> >
> >> >
> >> >or
> >> >
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
> >> >
> >> >The former is a round trip for each value update. The latter allows you
> >> to pack multiple updates
> >> >into a single round trip. This would give you accurate counts even with
> >> concurrent writers.
> >> >
> >> >It should be possible for you to do partial aggregation on the client
> >> side too whenever parallel
> >> >requests colocate multiple updates to the same cube within some small
> >> window of time.
> >> >
> >> >Best regards,
> >> >
> >> >
> >> >    - Andy
> >> >
> >> >Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> >> (via Tom White)
> >> >
> >> >----- Original Message -----
> >> >> From: Tom Brown <[EMAIL PROTECTED]>
> >> >> To: [EMAIL PROTECTED]
> >> >> Cc:
> >> >> Sent: Monday, April 9, 2012 9:48 AM
> >> >> Subject: Add client complexity or use a coprocessor?
> >> >>
> >> >> To whom it may concern,
> >> >>
> >> >> Ignoring the complexities of gathering the data, assume that I will
> be
> >> >> tracking millions of unique viewers. Updates from each of our
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB