Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Re: When/how to use partitions and buckets usefully?


+
Bejoy KS 2012-04-23, 16:05
+
Ruben de Vries 2012-04-23, 16:13
+
Bejoy KS 2012-04-23, 16:39
+
Mark Grover 2012-04-24, 01:16
+
Ruben de Vries 2012-04-24, 06:58
Copy link to this message
-
Re: When/how to use partitions and buckets usefully?
If you are doing a map side join make sure the table members_map is
small enough to hold in memory

On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote:
> Wow thanks everyone for the nice feedback!
>
> I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right?
>
>
> Cheers,
>
> Ruben de Vries
>
> -----Original Message-----
> From: Mark Grover [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 24, 2012 3:17 AM
> To: [EMAIL PROTECTED]; bejoy ks
> Cc: Ruben de Vries
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben,
> Like Bejoy pointed out, members_map is small enough to fit in memory, so
> your joins with visit_stats would be much faster with map-side join.
>
> However, there is still some virtue in bucketing visit_stats. Bucketing can
> optimize joins, group by's and potentially other queries in certain
> circumstances.
> You probably want to keep consistent bucketing columns across all your
> tables so they can leveraged in multi-table queries. Most people use some
> power of 2 as their number of buckets. To make the best use of the buckets,
> each of your buckets should be able to entirely load into memory on the
> node.
>
> I use something close the formula below to calculate the number of buckets:
>
> #buckets = (x * Average_partition_size) /
> JVM_memory_available_to_your_Hadoop_tasknode
>
> I call x (>1) the "factor of conservatism". Higher x means you are being
> more conservative by having larger number of buckets (and bearing the
> increased overhead), lower x means the reverse. What x to use would depend
> on your use case. This is because the number of buckets in a table is fixed.
> If you have a large partition, it would distribute it's data into bulkier
> buckets and you would want to make sure these bulkier buckets can still fit
> in memory. Moreover, buckets are generated using a hashing function, if you
> have a strong bias towards a particular value of bucketing column in your
> data, some buckets might be bulkier than others. In that case, you'd want to
> make sure that those bulkier buckets can still fit in memory.
>
> To summarize, it depends on:
> * How the actual partition sizes vary from the average partition size (i.e.
> the standard deviation of your partition size). More standard deviations
> means you should be more conservative in your calculation and vice-versa.
> * Distribution of the data in the bucketing columns. "Wider" distribution
> means you should be more conservative and vice-versa.
>
> Long story short, I would say, x of 2 to 4 should suffice in most cases but
> feel free to verify that in your case:-) I would love to hear what factors
> others have been using when calculating their number of buckets, BTW!
> Whatever answer you get for #buckets from above formula, use the closest
> power of 2 as the number of buckets in your table (I am not sure if this is
> a must, though).
>
> Good luck!
>
> Mark
>
> Mark Grover, Business Intelligence Analyst OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: [EMAIL PROTECTED]
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Bejoy KS" <[EMAIL PROTECTED]>
> To: "Ruben de Vries" <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
> Sent: Monday, April 23, 2012 12:39:17 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If data is in hdfs, then you can bucket it only after loading into a
> temp/staging table and then to the final bucketed table. Bucketing needs a
> Map reduce job.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> From: Ruben de Vries <[EMAIL PROTECTED]>
> Date: Mon, 23 Apr 2012 18:13:20 +0200
> To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>;
> [EMAIL PROTECTED]<[EMAIL PROTECTED]>
> Subject: RE: When/how to use partitions and buckets usefully?
>
>
>
>
> Thanks for the help so far guys,
>
>
>
> I bucketed the members_map, it’s 330mb in size (11 mil records).
Nitin Pawar
+
Bejoy Ks 2012-04-24, 07:46
+
Ruben de Vries 2012-04-24, 08:36
+
Nitin Pawar 2012-04-24, 09:45
+
Edward Capriolo 2012-04-23, 16:09
+
Ruben de Vries 2012-04-23, 15:19
+
Bejoy KS 2012-04-23, 15:31
+
Tucker, Matt 2012-04-23, 15:30
+
Ruben de Vries 2012-04-23, 15:38
+
Bejoy KS 2012-04-23, 16:03
+
Ruben de Vries 2012-04-24, 11:28
+
Bejoy Ks 2012-04-24, 13:57
+
Ruben de Vries 2012-04-24, 16:09
+
gemini alex 2012-04-25, 07:36
+
gemini alex 2012-04-25, 07:40
+
Ruben de Vries 2012-04-25, 07:48
+
Mark Grover 2012-04-26, 00:59
+
gemini alex 2012-04-26, 03:37
+
Ruben de Vries 2012-04-26, 07:16
+
Ruben de Vries 2012-04-26, 09:06
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB