Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Re: When/how to use partitions and buckets usefully?


+
Bejoy KS 2012-04-23, 16:05
+
Ruben de Vries 2012-04-23, 16:13
+
Bejoy KS 2012-04-23, 16:39
+
Mark Grover 2012-04-24, 01:16
+
Ruben de Vries 2012-04-24, 06:58
+
Nitin Pawar 2012-04-24, 07:16
+
Bejoy Ks 2012-04-24, 07:46
+
Ruben de Vries 2012-04-24, 08:36
Copy link to this message
-
Re: When/how to use partitions and buckets usefully?
This operation is erroring out on the hive client itself before starting a
map so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us
the result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]>wrote:

> Hmm I must be doing something wrong,  the members_map table is 300ish MB.
> When I execute the following query:
>
> SELECT
>    /*+ MAPJOIN(members_map) */
>    date_int,
>    members_map.gender AS gender,
>    'generic',
>    COUNT( memberId ) AS unique,
>    SUM( `generic`['count'] ) AS count,
>    SUM( `generic`['seconds'] ) AS seconds
> FROM visit_stats
> JOIN members_map ON(members_map.member_id = visit_stats.memberId)
> GROUP BY date_int, members_map.gender
>
> It results in:
> 2012-04-24 10:25:59     Starting to launch local task to process map join;
>      maximum memory = 1864171520
> 2012-04-24 10:26:00     Processing rows:        200000          Hashtable
> size: 199999          Memory usage:   43501848        rate:   0.023
> 2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size:
> 6899999 Memory usage:   1449867552      rate:   0.778
> 2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size:
> 6999999 Memory usage:   1468378760      rate:   0.788
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
>
>
> I'm running it only my local, single node, dev env, could that be a
> problem since it won't split over multiple mappers in this case?
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 24, 2012 9:47 AM
> To: [EMAIL PROTECTED]
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben
> Map join hint is provided to hive using "MAPJOIN" keyword as :
> SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
>
> To use map side join some hive configuration properties needs to be enabled
>
> For plain map side joins
> hive>SET hive.auto.convert.join=true;
> Latest versions of hive does a map join on the smaller table even if no
> map join hit is provided.
>
> For bucketed map joins
> hive>SET hive.optimize.bucketmapjoin=true
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>
>
> Regards
> Bejoy
>
>
> ________________________________
> From: Nitin Pawar <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, April 24, 2012 12:46 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If you are doing a map side join make sure the table members_map is small
> enough to hold in memory
>
> On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote:
> > Wow thanks everyone for the nice feedback!
> >
> > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */
> right?
> >
> >
> > Cheers,
> >
> > Ruben de Vries
> >
> > -----Original Message-----
> > From: Mark Grover [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, April 24, 2012 3:17 AM
> > To: [EMAIL PROTECTED]; bejoy ks
> > Cc: Ruben de Vries
> > Subject: Re: When/how to use partitions and buckets usefully?
> >
> > Hi Ruben,
> > Like Bejoy pointed out, members_map is small enough to fit in memory,
> > so your joins with visit_stats would be much faster with map-side join.
> >
> > However, there is still some virtue in bucketing visit_stats.
> > Bucketing can optimize joins, group by's and potentially other queries
> > in certain circumstances.
> > You probably want to keep consistent bucketing columns across all your
> > tables so they can leveraged in multi-table queries. Most people use
> > some power of 2 as their number of buckets. To make the best use of
> > the buckets, each of your buckets should be able to entirely load into
> > memory on the node.
> >
> > I use something close the formula below to calculate the number of
> buckets:
> >
> > #buckets = (x * Average_partition_size) /
> > JVM_memory_available_to_your_Hadoop_tasknode
> >
> > I call x (>1) the "factor of conservatism". Higher x means you are

Nitin Pawar
+
Edward Capriolo 2012-04-23, 16:09
+
Ruben de Vries 2012-04-23, 15:19
+
Bejoy KS 2012-04-23, 15:31
+
Tucker, Matt 2012-04-23, 15:30
+
Ruben de Vries 2012-04-23, 15:38
+
Bejoy KS 2012-04-23, 16:03
+
Ruben de Vries 2012-04-24, 11:28
+
Bejoy Ks 2012-04-24, 13:57
+
Ruben de Vries 2012-04-24, 16:09
+
gemini alex 2012-04-25, 07:36
+
gemini alex 2012-04-25, 07:40
+
Ruben de Vries 2012-04-25, 07:48
+
Mark Grover 2012-04-26, 00:59
+
gemini alex 2012-04-26, 03:37
+
Ruben de Vries 2012-04-26, 07:16
+
Ruben de Vries 2012-04-26, 09:06