Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> (Self) Joins on NULLable columns takes forever


Copy link to this message
-
Re: (Self) Joins on NULLable columns takes forever
Was there a typo below (v1 -> e1) ?

event_log v1 JOIN event_log e2 ON

On Mon, Aug 23, 2010 at 1:36 PM, Rajappa Iyer <[EMAIL PROTECTED]> wrote:

> Consider the following table (I've omitted things like additional columns
> and the serde specification since I think they are mostly irrelevant):
>
> CREATE TABLE event_log (visit_time bigint, visitor_id string, user_id
> string ...) PARTITIONED BY (dt string) ROW FORMAT ...;
>
> Where visitor_id is assigned to every visitor and user_id is only present
> for logged in users.  In other words, visitor_id is never NULL, but user_id
> can be.
>
> I want to find out, for a given day, how many visitors were also seen
> yesterday.  That is, I want the subset of visitor_id's which were also
> present in the previous day.
>
> Here's a query I wrote:
>
> SELECT count(1) FROM (
>       SELECT DISTINCT e1.visitor_id FROM
>             event_log v1 JOIN event_log e2 ON
>                     (e1.visitor_id = e2.visitor_id AND e1.dt='2010-08-22'
> AND e2.dt='2010-08-21')) event_log;
>
> The average number of total records for each partition is about 350K.
>
> This works, no problems.
>
> But if I issue the same query with visitor_id replaced by user_id, the
> Stage-1 reducer does not seem to make progress -- it remains stuck around
> say 68% or so in several runs.
>
> The last few lines from the log:
>
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 214000000 rows
> 2010-08-23 13:13:27,036 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 214000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 215000000 rows
> 2010-08-23 13:13:29,230 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 215000000 rows
> 2010-08-23 13:13:31,489 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 216000000 rows
> 2010-08-23 13:13:31,490 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 216000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 8 forwarding 217000000 rows
> 2010-08-23 13:13:33,680 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 217000000 rows
>
> The number of rows keep on increasing without bound it seems.
>
> Adding the condition "e1.user_id IS NOT NULL" to the join condition helps
> matters considerably in that the job finishes but it is still not as fast as
> the query for visitor_id.
>
> Any ideas on how to improve the speed of the query?
>
> This was on hive-0.5.0+20 from CDH3.
>
> Thanks,
> Raj
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB