|
karanveer.singh@...
2012-04-10, 13:44
Philip Tromans
2012-04-10, 14:17
karanveer.singh@...
2012-04-10, 14:37
David Kulp
2012-04-10, 14:45
karanveer.singh@...
2012-04-10, 14:51
David Kulp
2012-04-10, 14:56
Hamilton, Robert
2012-04-10, 15:01
Philip Tromans
2012-04-10, 15:02
David Kulp
2012-04-10, 15:07
Butani, Harish
2012-04-10, 15:10
karanveer.singh@...
2012-04-11, 05:43
Nitin Pawar
2012-04-11, 06:44
karanveer.singh@...
2012-04-11, 08:15
karanveer.singh@...
2012-04-11, 08:23
Mark Grover
2012-04-11, 13:31
Ashutosh Chauhan
2012-04-11, 14:54
Butani, Harish
2012-04-11, 21:39
|
-
Lag function in Hivekaranveer.singh@... 2012-04-10, 13:44
Hi,
Is there something like a 'lag' function in HIVE? The requirement is to calculate difference for the same column for every 2 subsequent records. For example. Row, Column A, Column B 1, 10, 100 2, 20, 200 3, 30, 300 The result that I need should be like: Row, Column A, Column B, Result 1, 10, 100, NULL 2, 20, 200, 100 (200-100) 3, 30, 300, 100 (300-200) Rgds, Karan This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC. Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority.
-
Re: Lag function in HivePhilip Tromans 2012-04-10, 14:17
Hi Karan,
To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: > Hi, > > Is there something like a ‘lag’ function in HIVE? The requirement is to > calculate difference for the same column for every 2 subsequent records. > > For example. > > Row, Column A, Column B > 1, 10, 100 > 2, 20, 200 > 3, 30, 300 > > > The result that I need should be like: > > Row, Column A, Column B, Result > 1, 10, 100, NULL > 2, 20, 200, 100 (200-100) > 3, 30, 300, 100 (300-200) > > Rgds, > Karan > > > > > > This e-mail and any attachments are confidential and intended solely for the > addressee and may also be privileged or exempt from disclosure under > applicable law. If you are not the addressee, or have received this e-mail > in error, please notify the sender immediately, delete it from your system > and do not copy, disclose or otherwise act upon any part of this e-mail or > its attachments. > > Internet communications are not guaranteed to be secure or virus-free. > The Barclays Group does not accept responsibility for any loss arising from > unauthorised access to, or interference with, any Internet communications by > any third party, or from the transmission of any viruses. Replies to this > e-mail may be monitored by the Barclays Group for operational or business > reasons. > > Any opinion or other information in this e-mail or its attachments that does > not relate to the business of the Barclays Group is personal to the sender > and is not given or endorsed by the Barclays Group. > > Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). > Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. > > Barclays Bank PLC is authorised and regulated by the Financial Services > Authority.
-
RE: Lag function in Hivekaranveer.singh@... 2012-04-10, 14:37
Makes sense but is not the distribution across nodes for a chunk of records in that order.
If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. Regards, Karan -----Original Message----- From: Philip Tromans [mailto:[EMAIL PROTECTED]] Sent: 10 April 2012 19:48 To: [EMAIL PROTECTED] Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: > Hi, > > Is there something like a 'lag' function in HIVE? The requirement is to > calculate difference for the same column for every 2 subsequent records. > > For example. > > Row, Column A, Column B > 1, 10, 100 > 2, 20, 200 > 3, 30, 300 > > > The result that I need should be like: > > Row, Column A, Column B, Result > 1, 10, 100, NULL > 2, 20, 200, 100 (200-100) > 3, 30, 300, 100 (300-200) > > Rgds, > Karan > > > > > > This e-mail and any attachments are confidential and intended solely for the > addressee and may also be privileged or exempt from disclosure under > applicable law. If you are not the addressee, or have received this e-mail > in error, please notify the sender immediately, delete it from your system > and do not copy, disclose or otherwise act upon any part of this e-mail or > its attachments. > > Internet communications are not guaranteed to be secure or virus-free. > The Barclays Group does not accept responsibility for any loss arising from > unauthorised access to, or interference with, any Internet communications by > any third party, or from the transmission of any viruses. Replies to this > e-mail may be monitored by the Barclays Group for operational or business > reasons. > > Any opinion or other information in this e-mail or its attachments that does > not relate to the business of the Barclays Group is personal to the sender > and is not given or endorsed by the Barclays Group. > > Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). > Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. > > Barclays Bank PLC is authorised and regulated by the Financial Services > Authority.
-
Re: Lag function in HiveDavid Kulp 2012-04-10, 14:45
New here. Hello all.
Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > Makes sense but is not the distribution across nodes for a chunk of records in that order. > > If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. > > Regards, > Karan > > > -----Original Message----- > From: Philip Tromans [mailto:[EMAIL PROTECTED]] > Sent: 10 April 2012 19:48 > To: [EMAIL PROTECTED] > Subject: Re: Lag function in Hive > > Hi Karan, > > To the best of my knowledge, there isn't one. It's also unlikely to > happen because it's hard to parallelise in a map-reduce way (it > requires knowing where you are in a result set, and who your > neighbours are and they in turn need to be present on the same node as > you which is difficult to guarantee). > > Cheers, > > Phil. > > On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Is there something like a 'lag' function in HIVE? The requirement is to >> calculate difference for the same column for every 2 subsequent records. >> >> For example. >> >> Row, Column A, Column B >> 1, 10, 100 >> 2, 20, 200 >> 3, 30, 300 >> >> >> The result that I need should be like: >> >> Row, Column A, Column B, Result >> 1, 10, 100, NULL >> 2, 20, 200, 100 (200-100) >> 3, 30, 300, 100 (300-200) >> >> Rgds, >> Karan >> >> >> >> >> >> This e-mail and any attachments are confidential and intended solely for the >> addressee and may also be privileged or exempt from disclosure under >> applicable law. If you are not the addressee, or have received this e-mail >> in error, please notify the sender immediately, delete it from your system >> and do not copy, disclose or otherwise act upon any part of this e-mail or >> its attachments. >> >> Internet communications are not guaranteed to be secure or virus-free. >> The Barclays Group does not accept responsibility for any loss arising from >> unauthorised access to, or interference with, any Internet communications by >> any third party, or from the transmission of any viruses. Replies to this >> e-mail may be monitored by the Barclays Group for operational or business >> reasons. >> >> Any opinion or other information in this e-mail or its attachments that does >> not relate to the business of the Barclays Group is personal to the sender >> and is not given or endorsed by the Barclays Group. >> >> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). >> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. >> >> Barclays Bank PLC is authorised and regulated by the Financial Services >> Authority.
-
Re: Lag function in Hivekaranveer.singh@... 2012-04-10, 14:51
Thanks - I will check this out.
Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? Rgds ----- Original Message ----- From: David Kulp <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Sent: Tue Apr 10 15:45:25 2012 Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > Makes sense but is not the distribution across nodes for a chunk of records in that order. > > If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. > > Regards, > Karan > > > -----Original Message----- > From: Philip Tromans [mailto:[EMAIL PROTECTED]] > Sent: 10 April 2012 19:48 > To: [EMAIL PROTECTED] > Subject: Re: Lag function in Hive > > Hi Karan, > > To the best of my knowledge, there isn't one. It's also unlikely to > happen because it's hard to parallelise in a map-reduce way (it > requires knowing where you are in a result set, and who your > neighbours are and they in turn need to be present on the same node as > you which is difficult to guarantee). > > Cheers, > > Phil. > > On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Is there something like a 'lag' function in HIVE? The requirement is to >> calculate difference for the same column for every 2 subsequent records. >> >> For example. >> >> Row, Column A, Column B >> 1, 10, 100 >> 2, 20, 200 >> 3, 30, 300 >> >> >> The result that I need should be like: >> >> Row, Column A, Column B, Result >> 1, 10, 100, NULL >> 2, 20, 200, 100 (200-100) >> 3, 30, 300, 100 (300-200) >> >> Rgds, >> Karan >> >> >> >> >> >> This e-mail and any attachments are confidential and intended solely for the >> addressee and may also be privileged or exempt from disclosure under >> applicable law. If you are not the addressee, or have received this e-mail >> in error, please notify the sender immediately, delete it from your system >> and do not copy, disclose or otherwise act upon any part of this e-mail or >> its attachments. >> >> Internet communications are not guaranteed to be secure or virus-free. >> The Barclays Group does not accept responsibility for any loss arising from >> unauthorised access to, or interference with, any Internet communications by >> any third party, or from the transmission of any viruses. Replies to this >> e-mail may be monitored by the Barclays Group for operational or business >> reasons. >> >> Any opinion or other information in this e-mail or its attachments that does >> not relate to the business of the Barclays Group is personal to the sender >> and is not given or endorsed by the Barclays Group. >> >> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). >> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. >> >> Barclays Bank PLC is authorised and regulated by the Financial Services >> Authority. This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC. Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority.
-
Re: Lag function in HiveDavid Kulp 2012-04-10, 14:56
You have to explicitly request it in CREATE TABLE. And you should generally let hive perform the clustering -- i.e. don't use an external table with data that is generated by some other process because it's hard to get the hash and notation right.
Check your table with "DESCRIBE FORMATTED tablename". On Apr 10, 2012, at 10:51 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > Thanks - I will check this out. > > Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? > > Rgds > > ----- Original Message ----- > From: David Kulp <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> > Sent: Tue Apr 10 15:45:25 2012 > Subject: Re: Lag function in Hive > > New here. Hello all. > > Could you try a self-join, possibly also restricted to partitions? > > E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar > > If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. > > -d > > On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > >> Makes sense but is not the distribution across nodes for a chunk of records in that order. >> >> If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. >> >> Regards, >> Karan >> >> >> -----Original Message----- >> From: Philip Tromans [mailto:[EMAIL PROTECTED]] >> Sent: 10 April 2012 19:48 >> To: [EMAIL PROTECTED] >> Subject: Re: Lag function in Hive >> >> Hi Karan, >> >> To the best of my knowledge, there isn't one. It's also unlikely to >> happen because it's hard to parallelise in a map-reduce way (it >> requires knowing where you are in a result set, and who your >> neighbours are and they in turn need to be present on the same node as >> you which is difficult to guarantee). >> >> Cheers, >> >> Phil. >> >> On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> Is there something like a 'lag' function in HIVE? The requirement is to >>> calculate difference for the same column for every 2 subsequent records. >>> >>> For example. >>> >>> Row, Column A, Column B >>> 1, 10, 100 >>> 2, 20, 200 >>> 3, 30, 300 >>> >>> >>> The result that I need should be like: >>> >>> Row, Column A, Column B, Result >>> 1, 10, 100, NULL >>> 2, 20, 200, 100 (200-100) >>> 3, 30, 300, 100 (300-200) >>> >>> Rgds, >>> Karan >>> >>> >>> >>> >>> >>> This e-mail and any attachments are confidential and intended solely for the >>> addressee and may also be privileged or exempt from disclosure under >>> applicable law. If you are not the addressee, or have received this e-mail >>> in error, please notify the sender immediately, delete it from your system >>> and do not copy, disclose or otherwise act upon any part of this e-mail or >>> its attachments. >>> >>> Internet communications are not guaranteed to be secure or virus-free. >>> The Barclays Group does not accept responsibility for any loss arising from >>> unauthorised access to, or interference with, any Internet communications by >>> any third party, or from the transmission of any viruses. Replies to this >>> e-mail may be monitored by the Barclays Group for operational or business >>> reasons. >>> >>> Any opinion or other information in this e-mail or its attachments that does >>> not relate to the business of the Barclays Group is personal to the sender >>> and is not given or endorsed by the Barclays Group. >>> >>> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
-
RE: Lag function in HiveHamilton, Robert 2012-04-10, 15:01
You can write a custom UDF -
Here is one that I have played around with, along with some test SQL. It comes with no warrantee :) Sorry I can't really share the test data, but hopefully you get the idea. To run, compile the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the hive command below. Note the "distribute by" and "sort by" are critical. Also the sub-select is just an artifice to make sure the UDF is running in the reducer (so that it is sorted). Maybe the hive experts can suggest a better way for that to work... # # use live clickstream test data from 2012-01-12 # hive -e "add jar Analytics.jar; create temporary function lag as 'com.example.hive.udf.Lag'; select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id is not null distribute by session_id sort by session_id,hit_datetime_gmt ) X distribute by session_id limit 1000 " ------------------------ Contents of Lag.java ----------------------------------------- package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public final class Lag extends UDF{ private int counter; private String last_key; private String lastGroup; private String return_value=""; public String evaluate(String key, String groupKey){ if(groupKey==null){ this.last_key=null; }else if ( !groupKey.equalsIgnoreCase(this.lastGroup )) { this.last_key=null; } return_value=this.last_key; this.last_key = key; this.lastGroup=groupKey; return return_value; } } Result of test run: 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:37 NULL 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:59 2012-01-12 00:00:37 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:05 2012-01-12 00:00:59 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:07 2012-01-12 00:01:05 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:11 2012-01-12 00:01:07 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:12 2012-01-12 00:01:11 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:24 2012-01-12 00:01:12 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:32 2012-01-12 00:01:24 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:45 2012-01-12 00:01:32 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:48 2012-01-12 00:01:45 -----Original Message----- From: Philip Tromans [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 10, 2012 9:18 AM To: [EMAIL PROTECTED] Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: > Hi, > > Is there something like a 'lag' function in HIVE? The requirement is > to calculate difference for the same column for every 2 subsequent records. > > For example. > > Row, Column A, Column B > 1, 10, 100 > 2, 20, 200 > 3, 30, 300 > > > The result that I need should be like: > > Row, Column A, Column B, Result > 1, 10, 100, NULL > 2, 20, 200, 100 (200-100) > 3, 30, 300, 100 (300-200) > > Rgds, > Karan > > > > > > This e-mail and any attachments are confidential and intended solely > for the addressee and may also be privileged or exempt from disclosure > under applicable law. If you are not the addressee, or have received > this e-mail in error, please notify the sender immediately, delete it
-
Re: Lag function in HivePhilip Tromans 2012-04-10, 15:02
I think you want something more like:
SELECT t2.value - t1.value FROM mytable t1 JOIN mytable t2 ON (t1.rownum = t2.rownum + 1 AND t2.partition=bar) WHERE t1.partition=foo; This should be faster as partition selection will happen earlier. This is still going to involve an awful lot of I/O, and not going to be fast. Phil. On 10 April 2012 15:56, David Kulp <[EMAIL PROTECTED]> wrote: > You have to explicitly request it in CREATE TABLE. And you should generally let hive perform the clustering -- i.e. don't use an external table with data that is generated by some other process because it's hard to get the hash and notation right. > Check your table with "DESCRIBE FORMATTED tablename". > > On Apr 10, 2012, at 10:51 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > >> Thanks - I will check this out. >> >> Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? >> >> Rgds >> >> ----- Original Message ----- >> From: David Kulp <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> >> Sent: Tue Apr 10 15:45:25 2012 >> Subject: Re: Lag function in Hive >> >> New here. Hello all. >> >> Could you try a self-join, possibly also restricted to partitions? >> >> E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar >> >> If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. >> >> -d >> >> On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: >> >>> Makes sense but is not the distribution across nodes for a chunk of records in that order. >>> >>> If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. >>> >>> Regards, >>> Karan >>> >>> >>> -----Original Message----- >>> From: Philip Tromans [mailto:[EMAIL PROTECTED]] >>> Sent: 10 April 2012 19:48 >>> To: [EMAIL PROTECTED] >>> Subject: Re: Lag function in Hive >>> >>> Hi Karan, >>> >>> To the best of my knowledge, there isn't one. It's also unlikely to >>> happen because it's hard to parallelise in a map-reduce way (it >>> requires knowing where you are in a result set, and who your >>> neighbours are and they in turn need to be present on the same node as >>> you which is difficult to guarantee). >>> >>> Cheers, >>> >>> Phil. >>> >>> On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> Is there something like a 'lag' function in HIVE? The requirement is to >>>> calculate difference for the same column for every 2 subsequent records. >>>> >>>> For example. >>>> >>>> Row, Column A, Column B >>>> 1, 10, 100 >>>> 2, 20, 200 >>>> 3, 30, 300 >>>> >>>> >>>> The result that I need should be like: >>>> >>>> Row, Column A, Column B, Result >>>> 1, 10, 100, NULL >>>> 2, 20, 200, 100 (200-100) >>>> 3, 30, 300, 100 (300-200) >>>> >>>> Rgds, >>>> Karan >>>> >>>> >>>> >>>> >>>> >>>> This e-mail and any attachments are confidential and intended solely for the >>>> addressee and may also be privileged or exempt from disclosure under >>>> applicable law. If you are not the addressee, or have received this e-mail >>>> in error, please notify the sender immediately, delete it from your system >>>> and do not copy, disclose or otherwise act upon any part of this e-mail or >>>> its attachments. >>>> >>>> Internet communications are not guaranteed to be secure or virus-free. >>>> The Barclays Group does not accept responsibility for any loss arising from >>>> unauthorised access to, or interference with, any Internet communications by
-
Re: Lag function in HiveDavid Kulp 2012-04-10, 15:07
Yeah. I don't think my SQL would even be accepted because Hive QL doesn't allow the alternate join syntax in the WHERE clause. Thanks Phil.
On Apr 10, 2012, at 11:02 AM, Philip Tromans wrote: > I think you want something more like: > > SELECT t2.value - t1.value > FROM mytable t1 > JOIN mytable t2 ON (t1.rownum = t2.rownum + 1 AND t2.partition=bar) > WHERE t1.partition=foo; > > This should be faster as partition selection will happen earlier. > > This is still going to involve an awful lot of I/O, and not going to be fast. > > Phil. > > > On 10 April 2012 15:56, David Kulp <[EMAIL PROTECTED]> wrote: >> You have to explicitly request it in CREATE TABLE. And you should generally let hive perform the clustering -- i.e. don't use an external table with data that is generated by some other process because it's hard to get the hash and notation right. >> Check your table with "DESCRIBE FORMATTED tablename". >> >> On Apr 10, 2012, at 10:51 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: >> >>> Thanks - I will check this out. >>> >>> Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? >>> >>> Rgds >>> >>> ----- Original Message ----- >>> From: David Kulp <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> >>> Sent: Tue Apr 10 15:45:25 2012 >>> Subject: Re: Lag function in Hive >>> >>> New here. Hello all. >>> >>> Could you try a self-join, possibly also restricted to partitions? >>> >>> E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar >>> >>> If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. >>> >>> -d >>> >>> On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: >>> >>>> Makes sense but is not the distribution across nodes for a chunk of records in that order. >>>> >>>> If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. >>>> >>>> Regards, >>>> Karan >>>> >>>> >>>> -----Original Message----- >>>> From: Philip Tromans [mailto:[EMAIL PROTECTED]] >>>> Sent: 10 April 2012 19:48 >>>> To: [EMAIL PROTECTED] >>>> Subject: Re: Lag function in Hive >>>> >>>> Hi Karan, >>>> >>>> To the best of my knowledge, there isn't one. It's also unlikely to >>>> happen because it's hard to parallelise in a map-reduce way (it >>>> requires knowing where you are in a result set, and who your >>>> neighbours are and they in turn need to be present on the same node as >>>> you which is difficult to guarantee). >>>> >>>> Cheers, >>>> >>>> Phil. >>>> >>>> On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: >>>>> Hi, >>>>> >>>>> Is there something like a 'lag' function in HIVE? The requirement is to >>>>> calculate difference for the same column for every 2 subsequent records. >>>>> >>>>> For example. >>>>> >>>>> Row, Column A, Column B >>>>> 1, 10, 100 >>>>> 2, 20, 200 >>>>> 3, 30, 300 >>>>> >>>>> >>>>> The result that I need should be like: >>>>> >>>>> Row, Column A, Column B, Result >>>>> 1, 10, 100, NULL >>>>> 2, 20, 200, 100 (200-100) >>>>> 3, 30, 300, 100 (300-200) >>>>> >>>>> Rgds, >>>>> Karan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> This e-mail and any attachments are confidential and intended solely for the >>>>> addressee and may also be privileged or exempt from disclosure under >>>>> applicable law. If you are not the addressee, or have received this e-mail >>>>> in error, please notify the sender immediately, delete it from your system
-
RE: Lag function in HiveButani, Harish 2012-04-10, 15:10
Hi Karan,
SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki) maybe a good fit for your use case. We have a lag function and you can say something like From table Partition by col1, col2... Order by col1, col2,... Select colX, <colX - lag(colX, 1)> (there is a lag example on the wiki, and other time series egs based on the NPath table function) You can control the partitioning by the partitioning and order clauses. Partitions could be arbitrarily large (so you could partition by a dummy column and have all rows in 1 partition) but works best when there are natural partitions in your data and you are ok with not needing to calculate across partitions. Regards, Harish. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 10, 2012 7:52 AM To: [EMAIL PROTECTED] Subject: Re: Lag function in Hive Thanks - I will check this out. Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? Rgds ----- Original Message ----- From: David Kulp <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Sent: Tue Apr 10 15:45:25 2012 Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > Makes sense but is not the distribution across nodes for a chunk of records in that order. > > If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. > > Regards, > Karan > > > -----Original Message----- > From: Philip Tromans [mailto:[EMAIL PROTECTED]] > Sent: 10 April 2012 19:48 > To: [EMAIL PROTECTED] > Subject: Re: Lag function in Hive > > Hi Karan, > > To the best of my knowledge, there isn't one. It's also unlikely to > happen because it's hard to parallelise in a map-reduce way (it > requires knowing where you are in a result set, and who your > neighbours are and they in turn need to be present on the same node as > you which is difficult to guarantee). > > Cheers, > > Phil. > > On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Is there something like a 'lag' function in HIVE? The requirement is to >> calculate difference for the same column for every 2 subsequent records. >> >> For example. >> >> Row, Column A, Column B >> 1, 10, 100 >> 2, 20, 200 >> 3, 30, 300 >> >> >> The result that I need should be like: >> >> Row, Column A, Column B, Result >> 1, 10, 100, NULL >> 2, 20, 200, 100 (200-100) >> 3, 30, 300, 100 (300-200) >> >> Rgds, >> Karan >> >> >> >> >> >> This e-mail and any attachments are confidential and intended solely for the >> addressee and may also be privileged or exempt from disclosure under >> applicable law. If you are not the addressee, or have received this e-mail >> in error, please notify the sender immediately, delete it from your system >> and do not copy, disclose or otherwise act upon any part of this e-mail or >> its attachments. >> >> Internet communications are not guaranteed to be secure or virus-free. >> The Barclays Group does not accept responsibility for any loss arising from >> unauthorised access to, or interference with, any Internet communications by >> any third party, or from the transmission of any viruses. Replies to this This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC. Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority.
-
RE: Lag function in Hivekaranveer.singh@... 2012-04-11, 05:43
When I try using rownum in my Hive QL query, I get: "Invalid column reference rownum". Am I missing something here? Regards, Karan -----Original Message----- From: David Kulp [mailto:[EMAIL PROTECTED]] Sent: 10 April 2012 20:15 To: [EMAIL PROTECTED] Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > Makes sense but is not the distribution across nodes for a chunk of records in that order. > > If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. > > Regards, > Karan > > > -----Original Message----- > From: Philip Tromans [mailto:[EMAIL PROTECTED]] > Sent: 10 April 2012 19:48 > To: [EMAIL PROTECTED] > Subject: Re: Lag function in Hive > > Hi Karan, > > To the best of my knowledge, there isn't one. It's also unlikely to > happen because it's hard to parallelise in a map-reduce way (it > requires knowing where you are in a result set, and who your > neighbours are and they in turn need to be present on the same node as > you which is difficult to guarantee). > > Cheers, > > Phil. > > On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Is there something like a 'lag' function in HIVE? The requirement is to >> calculate difference for the same column for every 2 subsequent records. >> >> For example. >> >> Row, Column A, Column B >> 1, 10, 100 >> 2, 20, 200 >> 3, 30, 300 >> >> >> The result that I need should be like: >> >> Row, Column A, Column B, Result >> 1, 10, 100, NULL >> 2, 20, 200, 100 (200-100) >> 3, 30, 300, 100 (300-200) >> >> Rgds, >> Karan >> >> >> >> >> >> This e-mail and any attachments are confidential and intended solely for the >> addressee and may also be privileged or exempt from disclosure under >> applicable law. If you are not the addressee, or have received this e-mail >> in error, please notify the sender immediately, delete it from your system >> and do not copy, disclose or otherwise act upon any part of this e-mail or >> its attachments. >> >> Internet communications are not guaranteed to be secure or virus-free. >> The Barclays Group does not accept responsibility for any loss arising from >> unauthorised access to, or interference with, any Internet communications by >> any third party, or from the transmission of any viruses. Replies to this >> e-mail may be monitored by the Barclays Group for operational or business >> reasons. >> >> Any opinion or other information in this e-mail or its attachments that does >> not relate to the business of the Barclays Group is personal to the sender >> and is not given or endorsed by the Barclays Group. >> >> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). >> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. >> >> Barclays Bank PLC is authorised and regulated by the Financial Services >> Authority.
-
Re: Lag function in HiveNitin Pawar 2012-04-11, 06:44
does your table have column called "rownum"?
I think From Philip's mail, it was just an example On Wed, Apr 11, 2012 at 11:13 AM, <[EMAIL PROTECTED]> wrote: > > When I try using rownum in my Hive QL query, I get: "Invalid column > reference rownum". Am I missing something here? > > Regards, > Karan > > > -----Original Message----- > From: David Kulp [mailto:[EMAIL PROTECTED]] > Sent: 10 April 2012 20:15 > To: [EMAIL PROTECTED] > Subject: Re: Lag function in Hive > > New here. Hello all. > > Could you try a self-join, possibly also restricted to partitions? > > E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE > t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar > > If your data is clustered by rownum, then this join should, in theory, be > relatively fast -- especially if it makes sense to exploit partitions. > > -d > > On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> < > [EMAIL PROTECTED]> wrote: > > > Makes sense but is not the distribution across nodes for a chunk of > records in that order. > > > > If Hive cannot help me do this, is there another way I can do this? I > tried generating an identifier using the perl script invoked using Hive but > it does not seem to work fine. While the stand alone script works fine, > when the record is created in hive using std output from perl - I see 2 > records for some of the unique identifiers. I explored the possibility of > default data type changes but that does not solve the problem. > > > > Regards, > > Karan > > > > > > -----Original Message----- > > From: Philip Tromans [mailto:[EMAIL PROTECTED]] > > Sent: 10 April 2012 19:48 > > To: [EMAIL PROTECTED] > > Subject: Re: Lag function in Hive > > > > Hi Karan, > > > > To the best of my knowledge, there isn't one. It's also unlikely to > > happen because it's hard to parallelise in a map-reduce way (it > > requires knowing where you are in a result set, and who your > > neighbours are and they in turn need to be present on the same node as > > you which is difficult to guarantee). > > > > Cheers, > > > > Phil. > > > > On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: > >> Hi, > >> > >> Is there something like a 'lag' function in HIVE? The requirement is to > >> calculate difference for the same column for every 2 subsequent records. > >> > >> For example. > >> > >> Row, Column A, Column B > >> 1, 10, 100 > >> 2, 20, 200 > >> 3, 30, 300 > >> > >> > >> The result that I need should be like: > >> > >> Row, Column A, Column B, Result > >> 1, 10, 100, NULL > >> 2, 20, 200, 100 (200-100) > >> 3, 30, 300, 100 (300-200) > >> > >> Rgds, > >> Karan > >> > >> > >> > >> > >> > >> This e-mail and any attachments are confidential and intended solely > for the > >> addressee and may also be privileged or exempt from disclosure under > >> applicable law. If you are not the addressee, or have received this > >> in error, please notify the sender immediately, delete it from your > system > >> and do not copy, disclose or otherwise act upon any part of this e-mail > or > >> its attachments. > >> > >> Internet communications are not guaranteed to be secure or virus-free. > >> The Barclays Group does not accept responsibility for any loss arising > from > >> unauthorised access to, or interference with, any Internet > communications by > >> any third party, or from the transmission of any viruses. Replies to > this > >> e-mail may be monitored by the Barclays Group for operational or > business > >> reasons. > >> > >> Any opinion or other information in this e-mail or its attachments that > does > >> not relate to the business of the Barclays Group is personal to the > sender > >> and is not given or endorsed by the Barclays Group. > >> > >> Barclays Bank PLC.Registered in England and Wales (registered no. > 1026167). > >> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. > >> > >> Barclays Bank PLC is authorised and regulated by the Financial Services Nitin Pawar
-
RE: Lag function in Hivekaranveer.singh@... 2012-04-11, 08:15
Rob n all -
I tried below and created the jar file. For adding jar to class path, I do following: hive> add jar /users/unix/singhka/Analytics.jar; The above seems to have worked fine as I see the resource added but when I go ahead and create a function, I get the following error. Any ideas what the issue can be? hive> create temporary function lag as 'com.example.hive.udf.Lag'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask Regards, -----Original Message----- From: Hamilton, Robert (Austin) [mailto:[EMAIL PROTECTED]] Sent: 10 April 2012 20:32 To: [EMAIL PROTECTED] Subject: RE: Lag function in Hive You can write a custom UDF - Here is one that I have played around with, along with some test SQL. It comes with no warrantee :) Sorry I can't really share the test data, but hopefully you get the idea. To run, compile the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the hive command below. Note the "distribute by" and "sort by" are critical. Also the sub-select is just an artifice to make sure the UDF is running in the reducer (so that it is sorted). Maybe the hive experts can suggest a better way for that to work... # # use live clickstream test data from 2012-01-12 # hive -e "add jar Analytics.jar; create temporary function lag as 'com.example.hive.udf.Lag'; select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id is not null distribute by session_id sort by session_id,hit_datetime_gmt ) X distribute by session_id limit 1000 " ------------------------ Contents of Lag.java ----------------------------------------- package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public final class Lag extends UDF{ private int counter; private String last_key; private String lastGroup; private String return_value=""; public String evaluate(String key, String groupKey){ if(groupKey==null){ this.last_key=null; }else if ( !groupKey.equalsIgnoreCase(this.lastGroup )) { this.last_key=null; } return_value=this.last_key; this.last_key = key; this.lastGroup=groupKey; return return_value; } } Result of test run: 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:37 NULL 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:59 2012-01-12 00:00:37 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:05 2012-01-12 00:00:59 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:07 2012-01-12 00:01:05 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:11 2012-01-12 00:01:07 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:12 2012-01-12 00:01:11 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:24 2012-01-12 00:01:12 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:32 2012-01-12 00:01:24 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:45 2012-01-12 00:01:32 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:48 2012-01-12 00:01:45 -----Original Message----- From: Philip Tromans [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 10, 2012 9:18 AM To: [EMAIL PROTECTED] Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: > Hi, >
-
RE: Lag function in Hivekaranveer.singh@... 2012-04-11, 08:23
That's the whole problem rite, I am unable to create a unique column for my record rows within Hive. If that's there, I can get the lag functionality to work for me.
I was hoping that ROWNUM will act like a pseudo column in Hive. Regards, ________________________________ From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: 11 April 2012 12:15 To: [EMAIL PROTECTED] Subject: Re: Lag function in Hive does your table have column called "rownum"? I think From Philip's mail, it was just an example On Wed, Apr 11, 2012 at 11:13 AM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: When I try using rownum in my Hive QL query, I get: "Invalid column reference rownum". Am I missing something here? Regards, Karan -----Original Message----- From: David Kulp [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: 10 April 2012 20:15 To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Makes sense but is not the distribution across nodes for a chunk of records in that order. > > If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. > > Regards, > Karan > > > -----Original Message----- > From: Philip Tromans [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] > Sent: 10 April 2012 19:48 > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Subject: Re: Lag function in Hive > > Hi Karan, > > To the best of my knowledge, there isn't one. It's also unlikely to > happen because it's hard to parallelise in a map-reduce way (it > requires knowing where you are in a result set, and who your > neighbours are and they in turn need to be present on the same node as > you which is difficult to guarantee). > > Cheers, > > Phil. > > On 10 April 2012 14:44, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> Hi, >> >> Is there something like a 'lag' function in HIVE? The requirement is to >> calculate difference for the same column for every 2 subsequent records. >> >> For example. >> >> Row, Column A, Column B >> 1, 10, 100 >> 2, 20, 200 >> 3, 30, 300 >> >> >> The result that I need should be like: >> >> Row, Column A, Column B, Result >> 1, 10, 100, NULL >> 2, 20, 200, 100 (200-100) >> 3, 30, 300, 100 (300-200) >> >> Rgds, >> Karan >> >> >> >> >> >> This e-mail and any attachments are confidential and intended solely for the >> addressee and may also be privileged or exempt from disclosure under >> applicable law. If you are not the addressee, or have received this e-mail >> in error, please notify the sender immediately, delete it from your system >> and do not copy, disclose or otherwise act upon any part of this e-mail or >> its attachments. >> >> Internet communications are not guaranteed to be secure or virus-free. >> The Barclays Group does not accept responsibility for any loss arising from >> unauthorised access to, or interference with, any Internet communications by >> any third party, or from the transmission of any viruses. Replies to this >> e-mail may be monitored by the Barclays Group for operational or business Nitin Pawar This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC. Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority.
-
Re: Lag function in HiveMark Grover 2012-04-11, 13:31
Hi Karan,
The error you mentioned you get on creating the temporary function typically happens when there is a typo in the class name (com.example.hive.udf.Lag, in this case). Can you ensure that the jar was properly built and contains the Lag class in the com.example.hive.udf package? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "karanveer singh" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, April 11, 2012 4:15:59 AM Subject: RE: Lag function in Hive Rob n all - I tried below and created the jar file. For adding jar to class path, I do following: hive> add jar /users/unix/singhka/Analytics.jar; The above seems to have worked fine as I see the resource added but when I go ahead and create a function, I get the following error. Any ideas what the issue can be? hive> create temporary function lag as 'com.example.hive.udf.Lag'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask Regards, -----Original Message----- From: Hamilton, Robert (Austin) [mailto:[EMAIL PROTECTED]] Sent: 10 April 2012 20:32 To: [EMAIL PROTECTED] Subject: RE: Lag function in Hive You can write a custom UDF - Here is one that I have played around with, along with some test SQL. It comes with no warrantee :) Sorry I can't really share the test data, but hopefully you get the idea. To run, compile the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the hive command below. Note the "distribute by" and "sort by" are critical. Also the sub-select is just an artifice to make sure the UDF is running in the reducer (so that it is sorted). Maybe the hive experts can suggest a better way for that to work... # # use live clickstream test data from 2012-01-12 # hive -e "add jar Analytics.jar; create temporary function lag as 'com.example.hive.udf.Lag'; select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id is not null distribute by session_id sort by session_id,hit_datetime_gmt ) X distribute by session_id limit 1000 " ------------------------ Contents of Lag.java ----------------------------------------- package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public final class Lag extends UDF{ private int counter; private String last_key; private String lastGroup; private String return_value=""; public String evaluate(String key, String groupKey){ if(groupKey==null){ this.last_key=null; }else if ( !groupKey.equalsIgnoreCase(this.lastGroup )) { this.last_key=null; } return_value=this.last_key; this.last_key = key; this.lastGroup=groupKey; return return_value; } } Result of test run: 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:37 NULL 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:59 2012-01-12 00:00:37 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:05 2012-01-12 00:00:59 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:07 2012-01-12 00:01:05 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:11 2012-01-12 00:01:07 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:12 2012-01-12 00:01:11 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:24 2012-01-12 00:01:12 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:32 2012-01-12 00:01:24 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:45 2012-01-12 00:01:32 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:48 2012-01-12 00:01:45 From: Philip Tromans [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 10, 2012 9:18 AM To: [EMAIL PROTECTED] Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote:
-
Re: Lag function in HiveAshutosh Chauhan 2012-04-11, 14:54
Hey Harish,
Awesome work on SQL Windowing. Judging from participation on this thread, it seems windowing is of sizable interest to Hive community. Would you consider contributing your work upstream in Hive? If its in Hive contrib, it will be accessible to lot of folks using Hive out of box. Thanks, Ashutosh On Tue, Apr 10, 2012 at 08:10, Butani, Harish <[EMAIL PROTECTED]> wrote: > Hi Karan, > > SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki) > maybe a good fit for your use case. > > We have a lag function and you can say something like > > From table > Partition by col1, col2... > Order by col1, col2,... > Select colX, <colX - lag(colX, 1)> > > (there is a lag example on the wiki, and other time series egs based on > the NPath table function) > > You can control the partitioning by the partitioning and order clauses. > Partitions could be arbitrarily large (so you could partition by a dummy > column and have all rows in 1 partition) but works best when there are > natural partitions in your data and you are ok with not needing to > calculate across partitions. > > > Regards, > Harish. > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 10, 2012 7:52 AM > To: [EMAIL PROTECTED] > Subject: Re: Lag function in Hive > > Thanks - I will check this out. > > Meanwhile, would default clustering happen using rownum? How can I check > on how is clustering happening in our environment? > > Rgds > > ----- Original Message ----- > From: David Kulp <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> > Sent: Tue Apr 10 15:45:25 2012 > Subject: Re: Lag function in Hive > > New here. Hello all. > > Could you try a self-join, possibly also restricted to partitions? > > E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE > t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar > > If your data is clustered by rownum, then this join should, in theory, be > relatively fast -- especially if it makes sense to exploit partitions. > > -d > > On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]> < > [EMAIL PROTECTED]> wrote: > > > Makes sense but is not the distribution across nodes for a chunk of > records in that order. > > > > If Hive cannot help me do this, is there another way I can do this? I > tried generating an identifier using the perl script invoked using Hive but > it does not seem to work fine. While the stand alone script works fine, > when the record is created in hive using std output from perl - I see 2 > records for some of the unique identifiers. I explored the possibility of > default data type changes but that does not solve the problem. > > > > Regards, > > Karan > > > > > > -----Original Message----- > > From: Philip Tromans [mailto:[EMAIL PROTECTED]] > > Sent: 10 April 2012 19:48 > > To: [EMAIL PROTECTED] > > Subject: Re: Lag function in Hive > > > > Hi Karan, > > > > To the best of my knowledge, there isn't one. It's also unlikely to > > happen because it's hard to parallelise in a map-reduce way (it > > requires knowing where you are in a result set, and who your > > neighbours are and they in turn need to be present on the same node as > > you which is difficult to guarantee). > > > > Cheers, > > > > Phil. > > > > On 10 April 2012 14:44, <[EMAIL PROTECTED]> wrote: > >> Hi, > >> > >> Is there something like a 'lag' function in HIVE? The requirement is to > >> calculate difference for the same column for every 2 subsequent records. > >> > >> For example. > >> > >> Row, Column A, Column B > >> 1, 10, 100 > >> 2, 20, 200 > >> 3, 30, 300 > >> > >> > >> The result that I need should be like: > >> > >> Row, Column A, Column B, Result > >> 1, 10, 100, NULL > >> 2, 20, 200, 100 (200-100) > >> 3, 30, 300, 100 (300-200) > >> > >> Rgds, > >> Karan > >> > >> > >> > >> > >> > >> This e-mail and any attachments are confidential and intended solely > for the
-
RE: Lag function in HiveButani, Harish 2012-04-11, 21:39
Hi Ashutosh,
Thanks for taking a look. Yes definitely open to contributing back to Hive. Had a conversation with Carl Steinbach last week about this. Will send you a follow up message. Regards, Harish From: Ashutosh Chauhan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 11, 2012 7:55 AM To: [EMAIL PROTECTED]; Butani, Harish Subject: Re: Lag function in Hive Hey Harish, Awesome work on SQL Windowing. Judging from participation on this thread, it seems windowing is of sizable interest to Hive community. Would you consider contributing your work upstream in Hive? If its in Hive contrib, it will be accessible to lot of folks using Hive out of box. Thanks, Ashutosh On Tue, Apr 10, 2012 at 08:10, Butani, Harish <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi Karan, SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki) maybe a good fit for your use case. We have a lag function and you can say something like >From table Partition by col1, col2... Order by col1, col2,... Select colX, <colX - lag(colX, 1)> (there is a lag example on the wiki, and other time series egs based on the NPath table function) You can control the partitioning by the partitioning and order clauses. Partitions could be arbitrarily large (so you could partition by a dummy column and have all rows in 1 partition) but works best when there are natural partitions in your data and you are ok with not needing to calculate across partitions. Regards, Harish. -----Original Message----- From: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Tuesday, April 10, 2012 7:52 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Lag function in Hive Thanks - I will check this out. Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? Rgds ----- Original Message ----- From: David Kulp <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Sent: Tue Apr 10 15:45:25 2012 Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Makes sense but is not the distribution across nodes for a chunk of records in that order. > > If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. > > Regards, > Karan > > > -----Original Message----- > From: Philip Tromans [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] > Sent: 10 April 2012 19:48 > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Subject: Re: Lag function in Hive > > Hi Karan, > > To the best of my knowledge, there isn't one. It's also unlikely to > happen because it's hard to parallelise in a map-reduce way (it > requires knowing where you are in a result set, and who your > neighbours are and they in turn need to be present on the same node as > you which is difficult to guarantee). > > Cheers, > > Phil. > > On 10 April 2012 14:44, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC. Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority. |