Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Lag function in Hive


Copy link to this message
-
Re: Lag function in Hive
Ashutosh Chauhan 2012-04-11, 14:54
Hey Harish,

Awesome work on SQL Windowing. Judging from participation on this thread,
it seems windowing is of sizable interest to Hive community. Would you
consider contributing your work upstream in Hive? If its in Hive contrib,
it will be accessible to lot of folks using Hive out of box.

Thanks,
Ashutosh

On Tue, Apr 10, 2012 at 08:10, Butani, Harish <[EMAIL PROTECTED]> wrote:

> Hi Karan,
>
> SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki)
> maybe a good fit for your use case.
>
> We have a lag function and you can say something like
>
> From table
> Partition by col1, col2...
> Order by col1, col2,...
> Select colX, <colX - lag(colX, 1)>
>
> (there is a lag example on the wiki, and other time series egs based on
> the NPath table function)
>
> You can control the partitioning by the partitioning and order clauses.
> Partitions could be arbitrarily large (so you could partition by a dummy
> column and have all rows in 1 partition) but works best when there are
> natural partitions in your data and you are ok with not needing to
> calculate across partitions.
>
>
> Regards,
> Harish.
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 10, 2012 7:52 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Lag function in Hive
>
> Thanks - I will check this out.
>
>  Meanwhile, would default clustering happen using rownum? How can I check
> on how is clustering happening in our environment?
>
> Rgds
>
> ----- Original Message -----
> From: David Kulp <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
> Sent: Tue Apr 10 15:45:25 2012
> Subject: Re: Lag function in Hive
>
> New here.  Hello all.
>
> Could you try a self-join, possibly also restricted to partitions?
>
> E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE
> t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar
>
> If your data is clustered by rownum, then this join should, in theory, be
> relatively fast -- especially if it makes sense to exploit partitions.
>
> -d
>
> On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <
> [EMAIL PROTECTED]> wrote:
>
> > Makes sense but is not the distribution across nodes for a chunk of
> records in that order.
> >
> > If Hive cannot help me do this, is there another way I can do this? I
> tried generating an identifier using the perl script invoked using Hive but
> it does not seem to work fine. While the stand alone script works fine,
> when the record is created in hive using std output from perl - I see 2
> records for some of the unique identifiers. I explored the possibility of
> default data type changes but that does not solve the problem.
> >
> > Regards,
> > Karan
> >
> >
> > -----Original Message-----
> > From: Philip Tromans [mailto:[EMAIL PROTECTED]]
> > Sent: 10 April 2012 19:48
> > To: [EMAIL PROTECTED]
> > Subject: Re: Lag function in Hive
> >
> > Hi Karan,
> >
> > To the best of my knowledge, there isn't one. It's also unlikely to
> > happen because it's hard to parallelise in a map-reduce way (it
> > requires knowing where you are in a result set, and who your
> > neighbours are and they in turn need to be present on the same node as
> > you which is difficult to guarantee).
> >
> > Cheers,
> >
> > Phil.
> >
> > On 10 April 2012 14:44,  <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >>
> >> Is there something like a 'lag' function in HIVE? The requirement is to
> >> calculate difference for the same column for every 2 subsequent records.
> >>
> >> For example.
> >>
> >> Row, Column A, Column B
> >> 1, 10, 100
> >> 2, 20, 200
> >> 3, 30, 300
> >>
> >>
> >> The result that I need should be like:
> >>
> >> Row, Column A, Column B, Result
> >> 1, 10, 100, NULL
> >> 2, 20, 200, 100 (200-100)
> >> 3, 30, 300, 100 (300-200)
> >>
> >> Rgds,
> >> Karan
> >>
> >>
> >>
> >>
> >>
> >> This e-mail and any attachments are confidential and intended solely
> for the