Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - When/how to use partitions and buckets usefully?


Copy link to this message
-
Re: When/how to use partitions and buckets usefully?
Bejoy KS 2012-04-23, 16:39
If data is in hdfs, then you can bucket it only after loading into a temp/staging table and then to the final bucketed table. Bucketing needs a Map reduce job.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Ruben de Vries <[EMAIL PROTECTED]>
Date: Mon, 23 Apr 2012 18:13:20
To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>; [EMAIL PROTECTED]<[EMAIL PROTECTED]>
Subject: RE: When/how to use partitions and buckets usefully?

Thanks for the help so far guys,

I bucketed the members_map, it's 330mb in size (11 mil records).

Can you manually bucket stuff?
Since my initial mapreduce job is still outside of Hive I'm doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down a lot

From: Bejoy KS [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 23, 2012 6:06 PM
To: [EMAIL PROTECTED]
Subject: Re: When/how to use partitions and buckets usefully?

For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: "Bejoy KS" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Mon, 23 Apr 2012 16:03:34 +0000
To: <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: When/how to use partitions and buckets usefully?

Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years?

The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement.
Regards
Bejoy KS

Sent from handheld, please excuse typos.
________________________________
From: Ruben de Vries <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Mon, 23 Apr 2012 17:38:20 +0200
To: [EMAIL PROTECTED]<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]%[EMAIL PROTECTED]>>
ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: RE: When/how to use partitions and buckets usefully?

Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;)

The member_map contains 11.636.619 records atm, I think bucketing those would be good?
What's a good number to bucket them by then?

And is there any point in bucketing the visit_stats?

From: Tucker, Matt [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]>
Sent: Monday, April 23, 2012 5:30 PM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: RE: When/how to use partitions and buckets usefully?

If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution.

If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages.  If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins.

Matt Tucker

From: Ruben de Vries [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]>
Sent: Monday, April 23, 2012 11:19 AM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: When/how to use partitions and buckets usefully?

It seems there's enough information to be found on how to setup and use partitions and buckets.
But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?!

In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age).

Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?!

It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them?

Maybe some ppl have written good blogs on this subject but I can't really seem to find them!?

Any help would be appreciated, thanks in advance :)