Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Incremental Data Processing With Hive UDAF

Copy link to this message
Re: Incremental Data Processing With Hive UDAF
buddhika chamith 2013-01-16, 07:06
Hi All,

After digging in to the code more I realized that GroupbyOperator can be
present at the map side of the computation as well, in which case it's
doing partial computations. So in that case the terminate of UDAF will get
called for partial results. However for the queries that I tried the
terminate methods inside the UDAFs in GroupbyOperator at reduce side tree
of the computation finishes with fully completed aggregation results as
expected. Can be behaviour be expected in any query? (Reduce side computing
fully aggregated result for any aggregation function)

The problem I am having is that I need a point where previous aggregation
results to be merged with the current run results. But since terminate can
behave bit differently depending on whether it's in map side or reduce side
would it make sense to selectively add this logic at reduce side based on
some configuration property? (I see property mapred.task.is.map can be of
potential use here).

Also there needs to be some identifier to uniquely identify the aggregation
UDAF in operator tree so that the previous aggregations can be fetched from
the result cache using that identifier. Is there such possibility where
aggregation function can be uniquely identified within the query?

I realize this might be a long shot but I am still up for it if this is
feasible albeit with some work. Or any other possible ways to achieve this
is highly appreciated.


On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith

> Any suggestions on this are greatly appreciated. Any one see major road
> blocks on this?
> Regards
> Buddhika
> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <
>> Hi All,
>> In order to achieve above I am researching on the feasibility of using a
>> set of custom UADFs for distributive aggregate operations (e.g: sum, count
>> etc..). Idea is to incorporate some state persisted from earlier
>> aggregations to the current aggregation value inside merge of the UDAF. For
>> distributing state data I was thinking of utilizing Hadoop distributed
>> cache. But I am not sure about how exactly UDAF's are executed at runtime.
>> Would including the logic to add the persisted state to the current result
>> at terminate() ensure that it would be added only once? (Assuming all the
>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
>> is there better way of achieving the same?
>> Regards
>> Buddhika