Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Incremental Data Processing With Hive UDAF


Copy link to this message
-
Re: Incremental Data Processing With Hive UDAF
Hi All,

Greatly appreciate any feedback on this. May be this may sound infeasible.
Just wanted check with the experts on this. Anyway the problem of
incremental data processing is a very interesting one if it can be
accommodated for.

Best Regards
Buddhika

On Wed, Jan 16, 2013 at 12:36 PM, buddhika chamith
<[EMAIL PROTECTED]>wrote:

> Hi All,
>
> After digging in to the code more I realized that GroupbyOperator can be
> present at the map side of the computation as well, in which case it's
> doing partial computations. So in that case the terminate of UDAF will get
> called for partial results. However for the queries that I tried the
> terminate methods inside the UDAFs in GroupbyOperator at reduce side tree
> of the computation finishes with fully completed aggregation results as
> expected. Can be behaviour be expected in any query? (Reduce side computing
> fully aggregated result for any aggregation function)
>
> The problem I am having is that I need a point where previous aggregation
> results to be merged with the current run results. But since terminate can
> behave bit differently depending on whether it's in map side or reduce side
> would it make sense to selectively add this logic at reduce side based on
> some configuration property? (I see property mapred.task.is.map can be of
> potential use here).
>
> Also there needs to be some identifier to uniquely identify the
> aggregation UDAF in operator tree so that the previous aggregations can be
> fetched from the result cache using that identifier. Is there such
> possibility where aggregation function can be uniquely identified within
> the query?
>
> I realize this might be a long shot but I am still up for it if this is
> feasible albeit with some work. Or any other possible ways to achieve this
> is highly appreciated.
>
> Regards
> Buddhika
>
>
> On Mon, Jan 14, 2013 at 8:16 PM, buddhika chamith <[EMAIL PROTECTED]
> > wrote:
>
>> Any suggestions on this are greatly appreciated. Any one see major road
>> blocks on this?
>>
>> Regards
>> Buddhika
>>
>>
>> On Sat, Jan 12, 2013 at 10:31 AM, buddhika chamith <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi All,
>>>
>>> In order to achieve above I am researching on the feasibility of using a
>>> set of custom UADFs for distributive aggregate operations (e.g: sum, count
>>> etc..). Idea is to incorporate some state persisted from earlier
>>> aggregations to the current aggregation value inside merge of the UDAF. For
>>> distributing state data I was thinking of utilizing Hadoop distributed
>>> cache. But I am not sure about how exactly UDAF's are executed at runtime.
>>> Would including the logic to add the persisted state to the current result
>>> at terminate() ensure that it would be added only once? (Assuming all the
>>> aggregations fan in at terminate. I may gotten it all wrong here. :)). Or
>>> is there better way of achieving the same?
>>>
>>> Regards
>>> Buddhika
>>>
>>
>>
>