Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Best Performance on Large Scale Join


Copy link to this message
-
Re: Best Performance on Large Scale Join
Brad,
whats the cluster capacity you have got?
how many uniq values of a,b  and c  you have got individually in any of the
one table?
Is there any chance you can partition data? are there any columns you have
on which you can create buckets?

I have done joins having 10 billion records in one table but other table
was significantly smaller. and I had a 1000 node cluster ad disposal

On Mon, Jul 29, 2013 at 11:08 PM, Brad Ruderman <[EMAIL PROTECTED]>wrote:

> Hi All-
>
> I have 2 tables:
>
> CREATE TABLE users (
> a bigint,
> b int
> )
>
> CREATE TABLE products (
> a bigint,
> c int
> )
>
> Each table has about 8 billion records (roughly 2k files total mappers). I
> want to know the most performant way to do the following query:
>
> SELECT u.b,
>               p.c,
>               count(*) as count
> FROM users u
> INNER JOIN products p
> ON u.a = p.a
> GROUP BY u.b, p.c
>
> Right now the reducing is killing me. Any suggestions on improving
> performance? Would a mapbucket join be optimal here?
>
> Thanks,
> Brad
>

--
Nitin Pawar