Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Best Performance on Large Scale Join


+
Brad Ruderman 2013-07-29, 17:38
+
Nitin Pawar 2013-07-29, 17:59
+
Brad Ruderman 2013-07-29, 18:37
Copy link to this message
-
Re: Best Performance on Large Scale Join
Perhaps you can first create a temp table that contains only the records that will match?  See the UNION ALL trick at
http://www.mail-archive.com/[EMAIL PROTECTED]/msg01906.html

________________________________
 From: Brad Ruderman <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, July 29, 2013 11:38 AM
Subject: Best Performance on Large Scale Join
 
Hi All-

I have 2 tables:

CREATE TABLE users (
a bigint,
b int
)

CREATE TABLE products (
a bigint,
c int
)

Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query:

SELECT u.b,
              p.c,
              count(*) as count
FROM users u
INNER JOIN products p
ON u.a = p.a
GROUP BY u.b, p.c

Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here?

Thanks,
Brad
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB