|
|
-
(Self) Joins on NULLable columns takes forever
Rajappa Iyer 2010-08-23, 20:36
Consider the following table (I've omitted things like additional columns and the serde specification since I think they are mostly irrelevant):
CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id string ...) PARTITIONED BY (dt string) ROW FORMAT ...;
Where visitor_id is assigned to every visitor and user_id is only present for logged in users. In other words, visitor_id is never NULL, but user_id can be.
I want to find out, for a given day, how many visitors were also seen yesterday. That is, I want the subset of visitor_id's which were also present in the previous day.
Here's a query I wrote:
SELECT count(1) FROM ( SELECT DISTINCT e1.visitor_id FROM event_log v1 JOIN event_log e2 ON (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22' AND e2.dt='2010-08-21')) event_log;
The average number of total records for each partition is about 350K.
This works, no problems.
But if I issue the same query with visitor_id replaced by user_id, the Stage-1 reducer does not seem to make progress -- it remains stuck around say 68% or so in several runs.
The last few lines from the log:
2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows
The number of rows keep on increasing without bound it seems.
Adding the condition "e1.user_id IS NOT NULL" to the join condition helps matters considerably in that the job finishes but it is still not as fast as the query for visitor_id.
Any ideas on how to improve the speed of the query?
This was on hive-0.5.0+20 from CDH3.
Thanks, Raj
-
Re: (Self) Joins on NULLable columns takes forever
Ted Yu 2010-08-23, 22:42
Was there a typo below (v1 -> e1) ?
event_log v1 JOIN event_log e2 ON
On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <[EMAIL PROTECTED]> wrote:
> Consider the following table (I've omitted things like additional columns > and the serde specification since I think they are mostly irrelevant): > > CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id > string ...) PARTITIONED BY (dt string) ROW FORMAT ...; > > Where visitor_id is assigned to every visitor and user_id is only present > for logged in users. In other words, visitor_id is never NULL, but user_id > can be. > > I want to find out, for a given day, how many visitors were also seen > yesterday. That is, I want the subset of visitor_id's which were also > present in the previous day. > > Here's a query I wrote: > > SELECT count(1) FROM ( > SELECT DISTINCT e1.visitor_id FROM > event_log v1 JOIN event_log e2 ON > (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22' > AND e2.dt='2010-08-21')) event_log; > > The average number of total records for each partition is about 350K. > > This works, no problems. > > But if I issue the same query with visitor_id replaced by user_id, the > Stage-1 reducer does not seem to make progress -- it remains stuck around > say 68% or so in several runs. > > The last few lines from the log: > > 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows > 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows > 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows > 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows > 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows > 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows > 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows > 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows > > The number of rows keep on increasing without bound it seems. > > Adding the condition "e1.user_id IS NOT NULL" to the join condition helps > matters considerably in that the job finishes but it is still not as fast as > the query for visitor_id. > > Any ideas on how to improve the speed of the query? > > This was on hive-0.5.0+20 from CDH3. > > Thanks, > Raj > >
-
Re: (Self) Joins on NULLable columns takes forever
Rajappa Iyer 2010-08-23, 22:52
Yep, that was a typo... sorry. It should read "event_log e1"
Thanks, Raj
On Mon, Aug 23, 2010 at 3:42 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Was there a typo below (v1 -> e1) ? > > > event_log v1 JOIN event_log e2 ON > > On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <[EMAIL PROTECTED]> wrote: > >> Consider the following table (I've omitted things like additional columns >> and the serde specification since I think they are mostly irrelevant): >> >> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id >> string ...) PARTITIONED BY (dt string) ROW FORMAT ...; >> >> Where visitor_id is assigned to every visitor and user_id is only present >> for logged in users. In other words, visitor_id is never NULL, but user_id >> can be. >> >> I want to find out, for a given day, how many visitors were also seen >> yesterday. That is, I want the subset of visitor_id's which were also >> present in the previous day. >> >> Here's a query I wrote: >> >> SELECT count(1) FROM ( >> SELECT DISTINCT e1.visitor_id FROM >> event_log v1 JOIN event_log e2 ON >> (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22' >> AND e2.dt='2010-08-21')) event_log; >> >> The average number of total records for each partition is about 350K. >> >> This works, no problems. >> >> But if I issue the same query with visitor_id replaced by user_id, the >> Stage-1 reducer does not seem to make progress -- it remains stuck around >> say 68% or so in several runs. >> >> The last few lines from the log: >> >> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows >> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows >> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows >> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows >> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows >> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows >> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows >> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows >> >> The number of rows keep on increasing without bound it seems. >> >> Adding the condition "e1.user_id IS NOT NULL" to the join condition helps >> matters considerably in that the job finishes but it is still not as fast as >> the query for visitor_id. >> >> Any ideas on how to improve the speed of the query? >> >> This was on hive-0.5.0+20 from CDH3. >> >> Thanks, >> Raj >> >> >
-
Re: (Self) Joins on NULLable columns takes forever
Rajappa Iyer 2010-08-23, 23:32
Sure. I did three runs each.
The times for the visitor_id query were 124.515, 118.673 and 115.33 seconds
for user_id query with "e1.user_id is not null": 241.201, 252.091 and 238.144 seconds
The slowness seems to be mainly due to Stage-1 and Stage-2 reduce.
FYI, the row counts are as follows:
2010-08-22, total 295465, with user_id NOT NULL: 230813 2010-08-21, total 267236, with user_id NOT NULL: 213522
Hope this helps.
Raj
On Mon, Aug 23, 2010 at 3:59 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Can you publish run time statistics for user_id query with the condition > "e1.user_id IS NOT NULL" ? > > This would allow us to see how much slower that query is compared to the > visitor_id query. > > Thanks > > > On Mon, Aug 23, 2010 at 3:52 PM, Rajappa Iyer <[EMAIL PROTECTED]> wrote: > >> Yep, that was a typo... sorry. It should read "event_log e1" >> >> Thanks, >> Raj >> >> >> On Mon, Aug 23, 2010 at 3:42 PM, Ted Yu <[EMAIL PROTECTED]> wrote: >> >>> Was there a typo below (v1 -> e1) ? >>> >>> >>> event_log v1 JOIN event_log e2 ON >>> >>> On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <[EMAIL PROTECTED]> wrote: >>> >>>> Consider the following table (I've omitted things like additional >>>> columns and the serde specification since I think they are mostly >>>> irrelevant): >>>> >>>> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id >>>> string ...) PARTITIONED BY (dt string) ROW FORMAT ...; >>>> >>>> Where visitor_id is assigned to every visitor and user_id is only >>>> present for logged in users. In other words, visitor_id is never NULL, but >>>> user_id can be. >>>> >>>> I want to find out, for a given day, how many visitors were also seen >>>> yesterday. That is, I want the subset of visitor_id's which were also >>>> present in the previous day. >>>> >>>> Here's a query I wrote: >>>> >>>> SELECT count(1) FROM ( >>>> SELECT DISTINCT e1.visitor_id FROM >>>> event_log v1 JOIN event_log e2 ON >>>> (e1.visitor_id = e2.visitor_id AND >>>> e1.dt='2010-08-22' AND e2.dt='2010-08-21')) event_log; >>>> >>>> The average number of total records for each partition is about 350K. >>>> >>>> This works, no problems. >>>> >>>> But if I issue the same query with visitor_id replaced by user_id, the >>>> Stage-1 reducer does not seem to make progress -- it remains stuck around >>>> say 68% or so in several runs. >>>> >>>> The last few lines from the log: >>>> >>>> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows >>>> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows >>>> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows >>>> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows >>>> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows >>>> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows >>>> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows >>>> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows >>>> >>>> The number of rows keep on increasing without bound it seems. >>>> >>>> Adding the condition "e1.user_id IS NOT NULL" to the join condition >>>> helps matters considerably in that the job finishes but it is still not as >>>> fast as the query for visitor_id. >>>> >>>> Any ideas on how to improve the speed of the query? >>>> >>>> This was on hive-0.5.0+20 from CDH3. >>>> >>>> Thanks, >>>> Raj >>>> >>>> >>> >> >
|
|