-Re: (Self) Joins on NULLable columns takes forever
Ted Yu 2010-08-23, 22:42
Was there a typo below (v1 -> e1) ?
event_log v1 JOIN event_log e2 ON
On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <[EMAIL PROTECTED]> wrote:
> Consider the following table (I've omitted things like additional columns
> and the serde specification since I think they are mostly irrelevant):
> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id
> string ...) PARTITIONED BY (dt string) ROW FORMAT ...;
> Where visitor_id is assigned to every visitor and user_id is only present
> for logged in users. In other words, visitor_id is never NULL, but user_id
> can be.
> I want to find out, for a given day, how many visitors were also seen
> yesterday. That is, I want the subset of visitor_id's which were also
> present in the previous day.
> Here's a query I wrote:
> SELECT count(1) FROM (
> SELECT DISTINCT e1.visitor_id FROM
> event_log v1 JOIN event_log e2 ON
> (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22'
> AND e2.dt='2010-08-21')) event_log;
> The average number of total records for each partition is about 350K.
> This works, no problems.
> But if I issue the same query with visitor_id replaced by user_id, the
> Stage-1 reducer does not seem to make progress -- it remains stuck around
> say 68% or so in several runs.
> The last few lines from the log:
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows
> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows
> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows
> The number of rows keep on increasing without bound it seems.
> Adding the condition "e1.user_id IS NOT NULL" to the join condition helps
> matters considerably in that the job finishes but it is still not as fast as
> the query for visitor_id.
> Any ideas on how to improve the speed of the query?
> This was on hive-0.5.0+20 from CDH3.