|
|
-
Join with greater/less then condition
sonia gehlot 2012-07-05, 19:21
Hi Guys,
I want to join 2 tables in hive on couple of columns and out them one condition is timestamp of one column is greater then the other one. In SQL I could have written in this way:
table_a a Join table_b b on a.user_id = b.user_id and a.title_id = b.title_id and a.timestamp > b.timestamp
How to write last condition in Pig? *a.timestamp > b.timestamp*
Thanks, Sonia
+
sonia gehlot 2012-07-05, 19:21
-
Re: Join with greater/less then condition
Alan Gates 2012-07-05, 19:43
Pig can only do equi-joins. Theta joins are hard in MapReduce. So the way to do this is do the equi-join and then filter afterwards. This will not create significant additional cost since the join results will be filtered before being materialized to disk.
C = Join table_a on user_id, title_id, table_b on user_id, title_id; D = filter C by table_a::timestamp > table_b::timestamp;
Alan.
On Jul 5, 2012, at 12:21 PM, sonia gehlot wrote:
> Hi Guys, > > I want to join 2 tables in hive on couple of columns and out them one > condition is timestamp of one column is greater then the other one. In SQL > I could have written in this way: > > table_a a Join table_b b > on a.user_id = b.user_id > and a.title_id = b.title_id > and a.timestamp > b.timestamp > > How to write last condition in Pig? *a.timestamp > b.timestamp* > > Thanks, > Sonia
+
Alan Gates 2012-07-05, 19:43
-
Re: Join with greater/less then condition
sonia gehlot 2012-07-05, 21:28
Actually I wanted to do left outer join, so not sure if filter will work in this case. On Thu, Jul 5, 2012 at 12:43 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
> Pig can only do equi-joins. Theta joins are hard in MapReduce. So the > way to do this is do the equi-join and then filter afterwards. This will > not create significant additional cost since the join results will be > filtered before being materialized to disk. > > C = Join table_a on user_id, title_id, table_b on user_id, title_id; > D = filter C by table_a::timestamp > table_b::timestamp; > > Alan. > > On Jul 5, 2012, at 12:21 PM, sonia gehlot wrote: > > > Hi Guys, > > > > I want to join 2 tables in hive on couple of columns and out them one > > condition is timestamp of one column is greater then the other one. In > SQL > > I could have written in this way: > > > > table_a a Join table_b b > > on a.user_id = b.user_id > > and a.title_id = b.title_id > > and a.timestamp > b.timestamp > > > > How to write last condition in Pig? *a.timestamp > b.timestamp* > > > > Thanks, > > Sonia > >
+
sonia gehlot 2012-07-05, 21:28
-
Re: Join with greater/less then condition
Dmitriy Ryaboy 2012-07-06, 15:09
Replace the filter with a foreach / generate (replace the right side with nulls of > condition is not satisfied)
On Thu, Jul 5, 2012 at 2:28 PM, sonia gehlot <[EMAIL PROTECTED]> wrote: > Actually I wanted to do left outer join, so not sure if filter will work in > this case. > > > On Thu, Jul 5, 2012 at 12:43 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > >> Pig can only do equi-joins. Theta joins are hard in MapReduce. So the >> way to do this is do the equi-join and then filter afterwards. This will >> not create significant additional cost since the join results will be >> filtered before being materialized to disk. >> >> C = Join table_a on user_id, title_id, table_b on user_id, title_id; >> D = filter C by table_a::timestamp > table_b::timestamp; >> >> Alan. >> >> On Jul 5, 2012, at 12:21 PM, sonia gehlot wrote: >> >> > Hi Guys, >> > >> > I want to join 2 tables in hive on couple of columns and out them one >> > condition is timestamp of one column is greater then the other one. In >> SQL >> > I could have written in this way: >> > >> > table_a a Join table_b b >> > on a.user_id = b.user_id >> > and a.title_id = b.title_id >> > and a.timestamp > b.timestamp >> > >> > How to write last condition in Pig? *a.timestamp > b.timestamp* >> > >> > Thanks, >> > Sonia >> >>
+
Dmitriy Ryaboy 2012-07-06, 15:09
|
|