Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Hive QL - NOT IN, NOT EXIST


+
Peter Chu 2013-05-06, 03:20
+
Michael Malak 2013-05-06, 03:25
+
Peter Chu 2013-05-06, 03:46
+
Stephen Boesch 2013-05-06, 04:44
+
Peter Chu 2013-05-06, 05:19
Copy link to this message
-
Re: Hive QL - NOT IN, NOT EXIST
Hi Peter,
   Looks like mapjoin does not work with outer join so streamtable is
instead a possible approach. You would stream the larger table through the
smaller one:

 can you see whether the following helps your perf issue?

select /*+ streamtable(message) */ f.uuid  from message m right outer join
feed f on m.uuid = f.uuid where m.uuid = null;
2013/5/5 Peter Chu <[EMAIL PROTECTED]>

> Thanks, Stephen,
>
> I do not quite understand what you mean by Stream, specifically "Stream
> the 400M message records through the in-memory maps".
> Can you please elaborate.
>
> Also, can you use MAPJOIN on left outer join?
>
> Peter
>
> ------------------------------
> Date: Sun, 5 May 2013 21:44:37 -0700
>
> Subject: Re: Hive QL - NOT IN, NOT EXIST
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
>
>
>
> @Peter  Does the query plan demonstrate that the 3Meg row table is being
> map-joined and the 400M table streamed through? That is what you want: but
> you might either need to fiddle with hints to get it to happen
>
> Details:
>     Read uuids s of feed into  in-memory map on all nodes (mapjoin)
>     Stream the 400M message records through the in-memory maps, copying
> id's from the "all feed uuids"  map to a  "matched feed uuid's map for
> entries that have matches in the messages
>
>      Note: this way the 400M rows are only read once on the cluster.
>
> You can see whether hive can manage this or if you write a custom m/r job
> to do it.
>
>
>
> 2013/5/5 Peter Chu <[EMAIL PROTECTED]>
>
> It works but it takes a very long time because the subqueries in NOT IN
> contains 400 million rows (the message table in the example) and the feed
> table contains 3 million rows.
>
> SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);
>
> > Date: Sun, 5 May 2013 20:25:15 -0700
> > From: [EMAIL PROTECTED]
> > Subject: Re: Hive QL - NOT IN, NOT EXIST
> > To: [EMAIL PROTECTED]
>
> >
> >
> > --- On Sun, 5/5/13, Peter Chu <[EMAIL PROTECTED]> wrote:
> >
> > > I am wondering if there is any way to do this without resorting to
> > > using left outer join and finding nulls.
> >
> > I have found this to be an acceptable substitute. Is it not working for
> you?
> >
>
>
>
+
Peter Chu 2013-05-06, 17:56
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB