Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> When/how to use partitions and buckets usefully?


Copy link to this message
-
Re: When/how to use partitions and buckets usefully?
Partitions are good when you want to run your queries on a subset of whole data. So the partition column depends on your queries. But a good point to be taken care is that every partition have enough data.
Partition gets into effect when you use filters with Where clause.

Buckets are good for sampling and joins like bucketed map joins.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Ruben de Vries <[EMAIL PROTECTED]>
Date: Mon, 23 Apr 2012 17:19:00
To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: When/how to use partitions and buckets usefully?

It seems there's enough information to be found on how to setup and use partitions and buckets.
But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?!

In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age).

Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?!

It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them?

Maybe some ppl have written good blogs on this subject but I can't really seem to find them!?

Any help would be appreciated, thanks in advance :)