Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Best Performance on Large Scale Join


Copy link to this message
-
Re: Best Performance on Large Scale Join
Perhaps you can first create a temp table that contains only the records that will match?  See the UNION ALL trick at
http://www.mail-archive.com/[EMAIL PROTECTED]/msg01906.html

________________________________
 From: Brad Ruderman <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, July 29, 2013 11:38 AM
Subject: Best Performance on Large Scale Join
 
Hi All-

I have 2 tables:

CREATE TABLE users (
a bigint,
b int
)

CREATE TABLE products (
a bigint,
c int
)

Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query:

SELECT u.b,
              p.c,
              count(*) as count
FROM users u
INNER JOIN products p
ON u.a = p.a
GROUP BY u.b, p.c

Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here?

Thanks,
Brad
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB