Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Best Performance on Large Scale Join


+
Brad Ruderman 2013-07-29, 17:38
Copy link to this message
-
Re: Best Performance on Large Scale Join
Brad,
whats the cluster capacity you have got?
how many uniq values of a,b  and c  you have got individually in any of the
one table?
Is there any chance you can partition data? are there any columns you have
on which you can create buckets?

I have done joins having 10 billion records in one table but other table
was significantly smaller. and I had a 1000 node cluster ad disposal

On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman <[EMAIL PROTECTED]>wrote:

> Hi All-
>
> I have 2 tables:
>
> CREATE TABLE users (
> a bigint,
> b int
> )
>
> CREATE TABLE products (
> a bigint,
> c int
> )
>
> Each table has about 8 billion records (roughly 2k files total mappers). I
> want to know the most performant way to do the following query:
>
> SELECT u.b,
>               p.c,
>               count(*) as count
> FROM users u
> INNER JOIN products p
> ON u.a = p.a
> GROUP BY u.b, p.c
>
> Right now the reducing is killing me. Any suggestions on improving
> performance? Would a mapbucket join be optimal here?
>
> Thanks,
> Brad
>

--
Nitin Pawar
+
Brad Ruderman 2013-07-29, 18:37
+
Michael Malak 2013-07-29, 18:04
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB