Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Add client complexity or use a coprocessor?


Copy link to this message
-
Re: Add client complexity or use a coprocessor?
What about maintaining a bloom filter in addition to an increment to
minimize double counting? You couldn't do atomic without some custom work
but it would get u mostly there.  If you wanted to be fancy you could
actually maintain the bloom as a bunch of separate colums to avoid update
contention.
On Apr 9, 2012 10:14 PM, "Tom Brown" <[EMAIL PROTECTED]> wrote:

> Andy,
>
> I am a big fan of the Increment class. Unfortunately, I'm not doing
> simple increments for the viewer count. I will be receiving duplicate
> messages from a particular client for a specific cube cell, and don't
> want them to be counted twice (my stats don't have to be 100%
> accurate, but the expected rate of duplicates will be higher than the
> allowable error rate).
>
> I created an RPC endpoint coprocessor to perform this function but
> performance suffered heavily under load (it appears that the endpoint
> performs all functions in serial).
>
> When I tried implementing it as a region observer, I was unsure of how
> to correctly replace the provided "put" with my own. When I issued a
> put from within "prePut", the server blocked the new put (waiting for
> the "prePut" to finish). Should I be attempting to modify the WALEdit
> object?
>
> Is there a way to extend the functionality of "Increment" to provide
> arbitrary bitwise operations on a the contents of a field?
>
> Thanks again!
>
> --Tom
>
> >If it helps, yes this is possible:
> >
> >> Can I observe updates to a
> >> particular table and replace the provided data with my own? (The
> >> client calls "put" with the actual user ID, my co-processor replaces
> >> it with a computed value, so the actual user ID never gets stored in
> >> HBase).
> >
> >Since your option #2 requires atomic updates to the data structure, have
> you considered native
> >atomic increments? See
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
> >
> >
> >or
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
> >
> >The former is a round trip for each value update. The latter allows you
> to pack multiple updates
> >into a single round trip. This would give you accurate counts even with
> concurrent writers.
> >
> >It should be possible for you to do partial aggregation on the client
> side too whenever parallel
> >requests colocate multiple updates to the same cube within some small
> window of time.
> >
> >Best regards,
> >
> >
> >    - Andy
> >
> >Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
> >
> >----- Original Message -----
> >> From: Tom Brown <[EMAIL PROTECTED]>
> >> To: [EMAIL PROTECTED]
> >> Cc:
> >> Sent: Monday, April 9, 2012 9:48 AM
> >> Subject: Add client complexity or use a coprocessor?
> >>
> >> To whom it may concern,
> >>
> >> Ignoring the complexities of gathering the data, assume that I will be
> >> tracking millions of unique viewers. Updates from each of our millions
> >> of clients are gathered in a centralized platform and spread among a
> >> group of machines for processing and inserting into HBase (assume that
> >> this group can be scaled horizontally). The data is stored in an OLAP
> >> cube format and one of the metrics I'm tracking across various
> >> attributes is viewership (how many people from Y are watching X).
> >>
> >> I'm writing this to ask for your thoughts as to the most appropriate
> >> way to structure my data so I can count unique TV viewers (assume a
> >> service like netflix or hulu).
> >>
> >> Here are the solutions I'm considering:
> >>
> >> 1. Store each unique user ID as the cell name within the cube(s) it
> >> occurs. This has the advantage of having 100% accuracy, but the
> >> downside is the enormous space required to store each unique cell.
> >> Consuming this data is also problematic as the only way to provide a
> >> viewership count is by counting each cell. To save the overhead of