-HyperLogLog Approximate Distinct Counting as a Hive UDAF
Nick Pentreath 2013-01-08, 14:00
I've recently committed an implementation of a Hive UDAF that uses
HyperLogLog for approximate distinct counting (
https://github.com/MLnick/hive-udf), based on Clearspring's stream-lib
Perhaps it may prove useful for others. The most interesting use case with
respect to Hive is the ability to aggregate data while keeping an accurate
sketch of distinct counts (say of user id's or some similar column) - thus
allowing further aggregation with accurate distinct counts on the fly,
without having to go back to the original source.
In the case of our data this would result in reduction of rows of data from
hundreds of millions (aggregating up to user id), down to tens of thousands.
If there is interest for inclusion in Hive, I could look at writing the
appropriate tests for inclusion in the Hive generic UDAF suite, and
submitting a ticket.