In order to achieve above I am researching on the feasibility of using a
set of custom UADFs for distributive aggregate operations (e.g: sum, count
etc..). Idea is to incorporate some state persisted from earlier
aggregations to the current aggregation value inside merge of the UDAF. For
distributing state data I was thinking of utilizing Hadoop distributed
cache. But I am not sure about how exactly UDAF's are executed at runtime.
Would including the logic to add the persisted state to the current result
at terminate() ensure that it would be added only once? (Assuming all the
aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
is there better way of achieving the same?