Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Add client complexity or use a coprocessor?


Copy link to this message
-
Re: Add client complexity or use a coprocessor?
What about maintaining a bloom filter in addition to an increment to
minimize double counting? You couldn't do atomic without some custom work
but it would get u mostly there.  If you wanted to be fancy you could
actually maintain the bloom as a bunch of separate colums to avoid update
contention.
On Apr 9, 2012 10:14 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote:

> Andy,
>
> I am a big fan of the Increment class. Unfortunately, I'm not doing
> simple increments for the viewer count. I will be receiving duplicate
> messages from a particular client for a specific cube cell, and don't
> want them to be counted twice (my stats don't have to be 100%
> accurate, but the expected rate of duplicates will be higher than the
> allowable error rate).
>
> I created an RPC endpoint coprocessor to perform this function but
> performance suffered heavily under load (it appears that the endpoint
> performs all functions in serial).
>
> When I tried implementing it as a region observer, I was unsure of how
> to correctly replace the provided "put" with my own. When I issued a
> put from within "prePut", the server blocked the new put (waiting for
> the "prePut" to finish). Should I be attempting to modify the WALEdit
> object?
>
> Is there a way to extend the functionality of "Increment" to provide
> arbitrary bitwise operations on a the contents of a field?
>
> Thanks again!
>
> --Tom
>
> >If it helps, yes this is possible:
> >
> >> Can I observe updates to a
> >> particular table and replace the provided data with my own? (The
> >> client calls "put" with the actual user ID, my co-processor replaces
> >> it with a computed value, so the actual user ID never gets stored in
> >> HBase).
> >
> >Since your option #2 requires atomic updates to the data structure, have
> you considered native
> >atomic increments? See
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
> >
> >
> >or
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
> >
> >The former is a round trip for each value update. The latter allows you
> to pack multiple updates
> >into a single round trip. This would give you accurate counts even with
> concurrent writers.
> >
> >It should be possible for you to do partial aggregation on the client
> side too whenever parallel
> >requests colocate multiple updates to the same cube within some small
> window of time.
> >
> >Best regards,
> >
> >
> >    - Andy
> >
> >Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
> >
> >----- Original Message -----
> >> From: Tom Brown <[EMAIL PROTECTED]>
> >> To: [EMAIL PROTECTED]
> >> Cc:
> >> Sent: Monday, April 9, 2012 9:48 AM
> >> Subject: Add client complexity or use a coprocessor?
> >>
> >> To whom it may concern,
> >>
> >> Ignoring the complexities of gathering the data, assume that I will be
> >> tracking millions of unique viewers. Updates from each of our millions
> >> of clients are gathered in a centralized platform and spread among a
> >> group of machines for processing and inserting into HBase (assume that
> >> this group can be scaled horizontally). The data is stored in an OLAP
> >> cube format and one of the metrics I'm tracking across various
> >> attributes is viewership (how many people from Y are watching X).
> >>
> >> I'm writing this to ask for your thoughts as to the most appropriate
> >> way to structure my data so I can count unique TV viewers (assume a
> >> service like netflix or hulu).
> >>
> >> Here are the solutions I'm considering:
> >>
> >> 1. Store each unique user ID as the cell name within the cube(s) it
> >> occurs. This has the advantage of having 100% accuracy, but the
> >> downside is the enormous space required to store each unique cell.
> >> Consuming this data is also problematic as the only way to provide a
> >> viewership count is by counting each cell. To save the overhead of
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB