I do not quite understand what you mean by Stream, specifically "Stream the 400M message records through the in-memory maps".Can you please elaborate.
Also, can you use MAPJOIN on left outer join?
Date: Sun, 5 May 2013 21:44:37 -0700
Subject: Re: Hive QL - NOT IN, NOT EXIST
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
@Peter Does the query plan demonstrate that the 3Meg row table is being map-joined and the 400M table streamed through? That is what you want: but you might either need to fiddle with hints to get it to happen
Details: Read uuids s of feed into in-memory map on all nodes (mapjoin) Stream the 400M message records through the in-memory maps, copying id's from the "all feed uuids" map to a "matched feed uuid's map for entries that have matches in the messages
Note: this way the 400M rows are only read once on the cluster.
You can see whether hive can manage this or if you write a custom m/r job to do it.
2013/5/5 Peter Chu <[EMAIL PROTECTED]>
It works but it takes a very long time because the subqueries in NOT IN contains 400 million rows (the message table in the example) and the feed table contains 3 million rows.
SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);
> Date: Sun, 5 May 2013 20:25:15 -0700
> From: [EMAIL PROTECTED]
> Subject: Re: Hive QL - NOT IN, NOT EXIST
> To: [EMAIL PROTECTED]
> --- On Sun, 5/5/13, Peter Chu <[EMAIL PROTECTED]> wrote:
> > I am wondering if there is any way to do this without resorting to
> > using left outer join and finding nulls.
> I have found this to be an acceptable substitute. Is it not working for you?