|
|
-
HyperLogLog Approximate Distinct Counting as a Hive UDAFNick Pentreath 2013-01-08, 14:00
Hi
I've recently committed an implementation of a Hive UDAF that uses HyperLogLog for approximate distinct counting ( https://github.com/MLnick/hive-udf), based on Clearspring's stream-lib library (https://github.com/clearspring/stream-lib). Perhaps it may prove useful for others. The most interesting use case with respect to Hive is the ability to aggregate data while keeping an accurate sketch of distinct counts (say of user id's or some similar column) - thus allowing further aggregation with accurate distinct counts on the fly, without having to go back to the original source. In the case of our data this would result in reduction of rows of data from hundreds of millions (aggregating up to user id), down to tens of thousands. If there is interest for inclusion in Hive, I could look at writing the appropriate tests for inclusion in the Hive generic UDAF suite, and submitting a ticket. Thanks Nick |