Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Hive QL - NOT IN, NOT EXIST


Copy link to this message
-
RE: Hive QL - NOT IN, NOT EXIST
Thanks Stephen,
Will start a cluster today to see if it helps.
Peter

Date: Mon, 6 May 2013 00:05:45 -0700
Subject: Re: Hive QL - NOT IN, NOT EXIST
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

Hi Peter,   Looks like mapjoin does not work with outer join so streamtable is instead a possible approach. You would stream the larger table through the smaller one:

 can you see whether the following helps your perf issue?
select /*+ streamtable(message) */ f.uuid  from message m right outer join feed f on m.uuid = f.uuid where m.uuid = null;

2013/5/5 Peter Chu <[EMAIL PROTECTED]>
Thanks, Stephen,
I do not quite understand what you mean by Stream, specifically "Stream the 400M message records through the in-memory maps".Can you please elaborate.

Also, can you use MAPJOIN on left outer join?
Peter

Date: Sun, 5 May 2013 21:44:37 -0700
Subject: Re: Hive QL - NOT IN, NOT EXIST

From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

@Peter  Does the query plan demonstrate that the 3Meg row table is being map-joined and the 400M table streamed through? That is what you want: but you might either need to fiddle with hints to get it to happen

Details:    Read uuids s of feed into  in-memory map on all nodes (mapjoin)     Stream the 400M message records through the in-memory maps, copying id's from the "all feed uuids"  map to a  "matched feed uuid's map for entries that have matches in the messages
     Note: this way the 400M rows are only read once on the cluster.
You can see whether hive can manage this or if you write a custom m/r job to do it.
 

2013/5/5 Peter Chu <[EMAIL PROTECTED]>

It works but it takes a very long time because the subqueries in NOT IN contains 400 million rows (the message table in the example) and the feed table contains 3 million rows.
SELECT uuid from feed f WHERE f.uuid NOT IN (SELECT uuid FROM message);
> Date: Sun, 5 May 2013 20:25:15 -0700
> From: [EMAIL PROTECTED]
> Subject: Re: Hive QL - NOT IN, NOT EXIST
> To: [EMAIL PROTECTED]
>
>
> --- On Sun, 5/5/13, Peter Chu <[EMAIL PROTECTED]> wrote:
>
> > I am wondering if there is any way to do this without resorting to
> > using left outer join and finding nulls.
>
> I have found this to be an acceptable substitute.  Is it not working for you?
>
     

     

     
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB