-Re: Incremental Data Processing With Hive UDAF
buddhika chamith 2013-01-17, 17:30
Greatly appreciate any feedback on this. May be this may sound infeasible.
Just wanted check with the experts on this. Anyway the problem of
incremental data processing is a very interesting one if it can be
On Wed, Jan 16, 2013 at 12:36 PM, buddhika chamith
> Hi All,
> After digging in to the code more I realized that GroupbyOperator can be
> present at the map side of the computation as well, in which case it's
> doing partial computations. So in that case the terminate of UDAF will get
> called for partial results. However for the queries that I tried the
> terminate methods inside the UDAFs in GroupbyOperator at reduce side tree
> of the computation finishes with fully completed aggregation results as
> expected. Can be behaviour be expected in any query? (Reduce side computing
> fully aggregated result for any aggregation function)
> The problem I am having is that I need a point where previous aggregation
> results to be merged with the current run results. But since terminate can
> behave bit differently depending on whether it's in map side or reduce side
> would it make sense to selectively add this logic at reduce side based on
> some configuration property? (I see property mapred.task.is.map can be of
> potential use here).
> Also there needs to be some identifier to uniquely identify the
> aggregation UDAF in operator tree so that the previous aggregations can be
> fetched from the result cache using that identifier. Is there such
> possibility where aggregation function can be uniquely identified within
> the query?
> I realize this might be a long shot but I am still up for it if this is
> feasible albeit with some work. Or any other possible ways to achieve this
> is highly appreciated.
> On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith <[EMAIL PROTECTED]
> > wrote:
>> Any suggestions on this are greatly appreciated. Any one see major road
>> blocks on this?
>> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <
>> [EMAIL PROTECTED]> wrote:
>>> Hi All,
>>> In order to achieve above I am researching on the feasibility of using a
>>> set of custom UADFs for distributive aggregate operations (e.g: sum, count
>>> etc..). Idea is to incorporate some state persisted from earlier
>>> aggregations to the current aggregation value inside merge of the UDAF. For
>>> distributing state data I was thinking of utilizing Hadoop distributed
>>> cache. But I am not sure about how exactly UDAF's are executed at runtime.
>>> Would including the logic to add the persisted state to the current result
>>> at terminate() ensure that it would be added only once? (Assuming all the
>>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
>>> is there better way of achieving the same?