|
Rakesh Setty
2009-06-29, 23:01
Amr Awadallah
2009-06-29, 23:14
Peter Skomoroch
2009-06-29, 23:19
Rakesh Setty
2009-06-29, 23:42
Namit Jain
2009-06-30, 00:04
Rakesh Setty
2009-06-30, 00:11
Namit Jain
2009-06-30, 00:19
Raghu Murthy
2009-06-30, 00:21
Rakesh Setty
2009-06-30, 00:28
|
-
Set difference in HiveRakesh Setty 2009-06-29, 23:01
Hi,
I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive? select user from page_views where user not in (select name from users); Thanks, Rakesh
-
Re: Set difference in HiveAmr Awadallah 2009-06-29, 23:14
do an outer join on user and filter on name.user is null -- amr Rakesh Setty wrote: > > Hi, > > > > I am new to Hive. I would like to know what is the easiest > way to get the difference between two sets. For example, how can I > convert the following SQL query to Hive? > > > > select user from page_views where user not in (select name from users); > > > > Thanks, > > Rakesh >
-
Re: Set difference in HivePeter Skomoroch 2009-06-29, 23:19
Here is an example of what Amr mentioned from one of my Hive scripts,
returns the set of pages not in "daily_pagecounts_table" select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id) where dp.page_id is NULL On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[EMAIL PROTECTED]> wrote: > > do an outer join on user and filter on name.user is null > > -- amr > > > Rakesh Setty wrote: > > Hi, > > > > I am new to Hive. I would like to know what is the easiest way > to get the difference between two sets. For example, how can I convert the > following SQL query to Hive? > > > > select user from page_views where user not in (select name from users); > > > > Thanks, > > Rakesh > > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
-
RE: Set difference in HiveRakesh Setty 2009-06-29, 23:42
Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key []
Both the tables are large and as Zheng mentions at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this? Thanks, Rakesh ________________________________ From: Peter Skomoroch [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 4:20 PM To: [EMAIL PROTECTED] Subject: Re: Set difference in Hive Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table" select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id) where dp.page_id is NULL On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: do an outer join on user and filter on name.user is null -- amr Rakesh Setty wrote: Hi, I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive? select user from page_views where user not in (select name from users); Thanks, Rakesh -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
-
RE: Set difference in HiveNamit Jain 2009-06-30, 00:04
The tables can be large -
For a given key, have the table with the most number of values as the rightmost table. The problem only happens when both the tables have keys with large number of values. Thanks, -namit From: Rakesh Setty [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 4:43 PM To: [EMAIL PROTECTED] Subject: RE: Set difference in Hive Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key [] Both the tables are large and as Zheng mentions at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this? Thanks, Rakesh ________________________________ From: Peter Skomoroch [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 4:20 PM To: [EMAIL PROTECTED] Subject: Re: Set difference in Hive Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table" select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id) where dp.page_id is NULL On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: do an outer join on user and filter on name.user is null -- amr Rakesh Setty wrote: Hi, I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive? select user from page_views where user not in (select name from users); Thanks, Rakesh -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
-
RE: Set difference in HiveRakesh Setty 2009-06-30, 00:11
I tried this. Unfortunately, both tables are large.
Thanks, Rakesh ________________________________ From: Namit Jain [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 5:05 PM To: [EMAIL PROTECTED] Subject: RE: Set difference in Hive The tables can be large - For a given key, have the table with the most number of values as the rightmost table. The problem only happens when both the tables have keys with large number of values. Thanks, -namit From: Rakesh Setty [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 4:43 PM To: [EMAIL PROTECTED] Subject: RE: Set difference in Hive Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key [] Both the tables are large and as Zheng mentions at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this? Thanks, Rakesh ________________________________ From: Peter Skomoroch [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 4:20 PM To: [EMAIL PROTECTED] Subject: Re: Set difference in Hive Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table" select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id) where dp.page_id is NULL On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: do an outer join on user and filter on name.user is null -- amr Rakesh Setty wrote: Hi, I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive? select user from page_views where user not in (select name from users); Thanks, Rakesh -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
-
RE: Set difference in HiveNamit Jain 2009-06-30, 00:19
It is independent of the size of the table. The cardinality of values for a given key that matters.
Is that also pretty large for both tables ? From: Rakesh Setty [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 5:11 PM To: [EMAIL PROTECTED] Subject: RE: Set difference in Hive I tried this. Unfortunately, both tables are large. Thanks, Rakesh ________________________________ From: Namit Jain [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 5:05 PM To: [EMAIL PROTECTED] Subject: RE: Set difference in Hive The tables can be large - For a given key, have the table with the most number of values as the rightmost table. The problem only happens when both the tables have keys with large number of values. Thanks, -namit From: Rakesh Setty [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 4:43 PM To: [EMAIL PROTECTED] Subject: RE: Set difference in Hive Thanks very much. But the reducer hangs with the warning WARN org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than joinEmitInterval rows for join key [] Both the tables are large and as Zheng mentions at http://www.mail-archive.com/[EMAIL PROTECTED]/msg00640.html, large size for table 0 is a problem. Is there any way to overcome this? Thanks, Rakesh ________________________________ From: Peter Skomoroch [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 4:20 PM To: [EMAIL PROTECTED] Subject: Re: Set difference in Hive Here is an example of what Amr mentioned from one of my Hive scripts, returns the set of pages not in "daily_pagecounts_table" select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON (dt.page_id = dp.page_id) where dp.page_id is NULL On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: do an outer join on user and filter on name.user is null -- amr Rakesh Setty wrote: Hi, I am new to Hive. I would like to know what is the easiest way to get the difference between two sets. For example, how can I convert the following SQL query to Hive? select user from page_views where user not in (select name from users); Thanks, Rakesh -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
-
Re: Set difference in HiveRaghu Murthy 2009-06-30, 00:21
Do you care about getting duplicate 'user' values in the result? If not, you
could just get rid of duplicates before doing the join. select a.user from (select distinct user from page_views) a LEFT OUTER JOIN users b ON (a.user = b.user) WHERE b.user is NULL Note that the join runs out of memory in the reducer only if the number of duplicates is large, not just when the tables are large. On 6/29/09 5:11 PM, "Rakesh Setty" <[EMAIL PROTECTED]> wrote: > I tried this. Unfortunately, both tables are large. > > Thanks, > Rakesh > > > > From: Namit Jain [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 5:05 PM > To: [EMAIL PROTECTED] > Subject: RE: Set difference in Hive > > The tables can be large - > > For a given key, have the table with the most number of values as the > rightmost table. > > The problem only happens when both the tables have keys with large number of > values. > > > Thanks, > -namit > > > > From: Rakesh Setty [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 4:43 PM > To: [EMAIL PROTECTED] > Subject: RE: Set difference in Hive > > Thanks very much. But the reducer hangs with the warning WARN > org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than > joinEmitInterval rows for join key [] > Both the tables are large and as Zheng mentions at > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00640.html, large > size for table 0 is a problem. Is there any way to overcome this? > > Thanks, > Rakesh > > > > From: Peter Skomoroch [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 4:20 PM > To: [EMAIL PROTECTED] > Subject: Re: Set difference in Hive > > Here is an example of what Amr mentioned from one of my Hive scripts, returns > the set of pages not in "daily_pagecounts_table" > > select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews > FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON > (dt.page_id = dp.page_id) > where dp.page_id is NULL > > On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[EMAIL PROTECTED]> wrote: > > > do an outer join on user and filter on name.user is null > > -- amr > > > > Rakesh Setty wrote: > > Hi, > > > > I am new to Hive. I would like to know what is the easiest way to > get the difference between two sets. For example, how can I convert the > following SQL query to Hive? > > > > select user from page_views where user not in (select name from users); > > > > Thanks, > > Rakesh > > >
-
RE: Set difference in HiveRakesh Setty 2009-06-30, 00:28
Thanks Raghu and Namit. This works.
Thanks, Rakesh -----Original Message----- From: Raghu Murthy [mailto:[EMAIL PROTECTED]] Sent: Monday, June 29, 2009 5:22 PM To: [EMAIL PROTECTED] Subject: Re: Set difference in Hive Do you care about getting duplicate 'user' values in the result? If not, you could just get rid of duplicates before doing the join. select a.user from (select distinct user from page_views) a LEFT OUTER JOIN users b ON (a.user = b.user) WHERE b.user is NULL Note that the join runs out of memory in the reducer only if the number of duplicates is large, not just when the tables are large. On 6/29/09 5:11 PM, "Rakesh Setty" <[EMAIL PROTECTED]> wrote: > I tried this. Unfortunately, both tables are large. > > Thanks, > Rakesh > > > > From: Namit Jain [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 5:05 PM > To: [EMAIL PROTECTED] > Subject: RE: Set difference in Hive > > The tables can be large - > > For a given key, have the table with the most number of values as the > rightmost table. > > The problem only happens when both the tables have keys with large number of > values. > > > Thanks, > -namit > > > > From: Rakesh Setty [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 4:43 PM > To: [EMAIL PROTECTED] > Subject: RE: Set difference in Hive > > Thanks very much. But the reducer hangs with the warning WARN > org.apache.hadoop.hive.ql.exec.JoinOperator: table 0 has more than > joinEmitInterval rows for join key [] > Both the tables are large and as Zheng mentions at > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00640.html, large > size for table 0 is a problem. Is there any way to overcome this? > > Thanks, > Rakesh > > > > From: Peter Skomoroch [mailto:[EMAIL PROTECTED]] > Sent: Monday, June 29, 2009 4:20 PM > To: [EMAIL PROTECTED] > Subject: Re: Set difference in Hive > > Here is an example of what Amr mentioned from one of my Hive scripts, returns > the set of pages not in "daily_pagecounts_table" > > select dt.page_id, dt.dates, dt.pageviews, dt.total_pageviews > FROM daily_timelines dt LEFT OUTER JOIN daily_pagecounts_table dp ON > (dt.page_id = dp.page_id) > where dp.page_id is NULL > > On Mon, Jun 29, 2009 at 7:14 PM, Amr Awadallah <[EMAIL PROTECTED]> wrote: > > > do an outer join on user and filter on name.user is null > > -- amr > > > > Rakesh Setty wrote: > > Hi, > > > > I am new to Hive. I would like to know what is the easiest way to > get the difference between two sets. For example, how can I convert the > following SQL query to Hive? > > > > select user from page_views where user not in (select name from users); > > > > Thanks, > > Rakesh > > > |