Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> When/how to use partitions and buckets usefully?


Copy link to this message
-
Re: When/how to use partitions and buckets usefully?
This operation is erroring out on the hive client itself before starting a
map so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us
the result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]>wrote:

> Hmm I must be doing something wrong,  the members_map table is 300ish MB.
> When I execute the following query:
>
> SELECT
>    /*+ MAPJOIN(members_map) */
>    date_int,
>    members_map.gender AS gender,
>    'generic',
>    COUNT( memberId ) AS unique,
>    SUM( `generic`['count'] ) AS count,
>    SUM( `generic`['seconds'] ) AS seconds
> FROM visit_stats
> JOIN members_map ON(members_map.member_id = visit_stats.memberId)
> GROUP BY date_int, members_map.gender
>
> It results in:
> 2012-04-24 10:25:59     Starting to launch local task to process map join;
>      maximum memory = 1864171520
> 2012-04-24 10:26:00     Processing rows:        200000          Hashtable
> size: 199999          Memory usage:   43501848        rate:   0.023
> 2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size:
> 6899999 Memory usage:   1449867552      rate:   0.778
> 2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size:
> 6999999 Memory usage:   1468378760      rate:   0.788
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
>
>
> I'm running it only my local, single node, dev env, could that be a
> problem since it won't split over multiple mappers in this case?
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 24, 2012 9:47 AM
> To: [EMAIL PROTECTED]
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben
> Map join hint is provided to hive using "MAPJOIN" keyword as :
> SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
>
> To use map side join some hive configuration properties needs to be enabled
>
> For plain map side joins
> hive>SET hive.auto.convert.join=true;
> Latest versions of hive does a map join on the smaller table even if no
> map join hit is provided.
>
> For bucketed map joins
> hive>SET hive.optimize.bucketmapjoin=true
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>
>
> Regards
> Bejoy
>
>
> ________________________________
> From: Nitin Pawar <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, April 24, 2012 12:46 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If you are doing a map side join make sure the table members_map is small
> enough to hold in memory
>
> On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote:
> > Wow thanks everyone for the nice feedback!
> >
> > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */
> right?
> >
> >
> > Cheers,
> >
> > Ruben de Vries
> >
> > -----Original Message-----
> > From: Mark Grover [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, April 24, 2012 3:17 AM
> > To: [EMAIL PROTECTED]; bejoy ks
> > Cc: Ruben de Vries
> > Subject: Re: When/how to use partitions and buckets usefully?
> >
> > Hi Ruben,
> > Like Bejoy pointed out, members_map is small enough to fit in memory,
> > so your joins with visit_stats would be much faster with map-side join.
> >
> > However, there is still some virtue in bucketing visit_stats.
> > Bucketing can optimize joins, group by's and potentially other queries
> > in certain circumstances.
> > You probably want to keep consistent bucketing columns across all your
> > tables so they can leveraged in multi-table queries. Most people use
> > some power of 2 as their number of buckets. To make the best use of
> > the buckets, each of your buckets should be able to entirely load into
> > memory on the node.
> >
> > I use something close the formula below to calculate the number of
> buckets:
> >
> > #buckets = (x * Average_partition_size) /
> > JVM_memory_available_to_your_Hadoop_tasknode
> >
> > I call x (>1) the "factor of conservatism". Higher x means you are

Nitin Pawar