Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Best Performance on Large Scale Join


+
Brad Ruderman 2013-07-29, 17:38
+
Nitin Pawar 2013-07-29, 17:59
+
Brad Ruderman 2013-07-29, 18:37
Copy link to this message
-
Re: Best Performance on Large Scale Join
Perhaps you can first create a temp table that contains only the records that will match?  See the UNION ALL trick at
http://www.mail-archive.com/[EMAIL PROTECTED]/msg01906.html

________________________________
 From: Brad Ruderman <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, July 29, 2013 11:38 AM
Subject: Best Performance on Large Scale Join
 
Hi All-

I have 2 tables:

CREATE TABLE users (
a bigint,
b int
)

CREATE TABLE products (
a bigint,
c int
)

Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query:

SELECT u.b,
              p.c,
              count(*) as count
FROM users u
INNER JOIN products p
ON u.a = p.a
GROUP BY u.b, p.c

Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here?

Thanks,
Brad