Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Hive QL - NOT IN, NOT EXIST


+
Peter Chu 2013-05-06, 03:20
+
Michael Malak 2013-05-06, 03:25
+
Peter Chu 2013-05-06, 03:46
Copy link to this message
-
Re: Hive QL - NOT IN, NOT EXIST
Stephen Boesch 2013-05-06, 04:44
@Peter  Does the query plan demonstrate that the 3Meg row table is being
map-joined and the 400M table streamed through? That is what you want: but
you might either need to fiddle with hints to get it to happen

Details:
    Read uuids s of feed into  in-memory map on all nodes (mapjoin)
    Stream the 400M message records through the in-memory maps, copying
id's from the "all feed uuids"  map to a  "matched feed uuid's map for
entries that have matches in the messages

     Note: this way the 400M rows are only read once on the cluster.

You can see whether hive can manage this or if you write a custom m/r job
to do it.

2013/5/5 Peter Chu <[EMAIL PROTECTED]>

> It works but it takes a very long time because the subqueries in NOT IN
> contains 400 million rows (the message table in the example) and the feed
> table contains 3 million rows.
>
> SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);
>
> > Date: Sun, 5 May 2013 20:25:15 -0700
> > From: [EMAIL PROTECTED]
> > Subject: Re: Hive QL - NOT IN, NOT EXIST
> > To: [EMAIL PROTECTED]
>
> >
> >
> > --- On Sun, 5/5/13, Peter Chu <[EMAIL PROTECTED]> wrote:
> >
> > > I am wondering if there is any way to do this without resorting to
> > > using left outer join and finding nulls.
> >
> > I have found this to be an acceptable substitute. Is it not working for
> you?
> >
>
+
Peter Chu 2013-05-06, 05:19
+
Stephen Boesch 2013-05-06, 07:05
+
Peter Chu 2013-05-06, 17:56