Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Re: When/how to use partitions and buckets usefully?


+
Bejoy KS 2012-04-23, 16:05
+
Ruben de Vries 2012-04-23, 16:13
+
Bejoy KS 2012-04-23, 16:39
+
Mark Grover 2012-04-24, 01:16
+
Ruben de Vries 2012-04-24, 06:58
+
Nitin Pawar 2012-04-24, 07:16
+
Bejoy Ks 2012-04-24, 07:46
+
Ruben de Vries 2012-04-24, 08:36
Copy link to this message
-
Re: When/how to use partitions and buckets usefully?
This operation is erroring out on the hive client itself before starting a
map so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us
the result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]>wrote:

> Hmm I must be doing something wrong,  the members_map table is 300ish MB.
> When I execute the following query:
>
> SELECT
>    /*+ MAPJOIN(members_map) */
>    date_int,
>    members_map.gender AS gender,
>    'generic',
>    COUNT( memberId ) AS unique,
>    SUM( `generic`['count'] ) AS count,
>    SUM( `generic`['seconds'] ) AS seconds
> FROM visit_stats
> JOIN members_map ON(members_map.member_id = visit_stats.memberId)
> GROUP BY date_int, members_map.gender
>
> It results in:
> 2012-04-24 10:25:59     Starting to launch local task to process map join;
>      maximum memory = 1864171520
> 2012-04-24 10:26:00     Processing rows:        200000          Hashtable
> size: 199999          Memory usage:   43501848        rate:   0.023
> 2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size:
> 6899999 Memory usage:   1449867552      rate:   0.778
> 2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size:
> 6999999 Memory usage:   1468378760      rate:   0.788
> Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
>
>
> I'm running it only my local, single node, dev env, could that be a
> problem since it won't split over multiple mappers in this case?
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 24, 2012 9:47 AM
> To: [EMAIL PROTECTED]
> Subject: Re: When/how to use partitions and buckets usefully?
>
> Hi Ruben
> Map join hint is provided to hive using "MAPJOIN" keyword as :
> SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
>
> To use map side join some hive configuration properties needs to be enabled
>
> For plain map side joins
> hive>SET hive.auto.convert.join=true;
> Latest versions of hive does a map join on the smaller table even if no
> map join hit is provided.
>
> For bucketed map joins
> hive>SET hive.optimize.bucketmapjoin=true
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>
>
> Regards
> Bejoy
>
>
> ________________________________
> From: Nitin Pawar <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, April 24, 2012 12:46 PM
> Subject: Re: When/how to use partitions and buckets usefully?
>
> If you are doing a map side join make sure the table members_map is small
> enough to hold in memory
>
> On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote:
> > Wow thanks everyone for the nice feedback!
> >
> > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */
> right?
> >
> >
> > Cheers,
> >
> > Ruben de Vries
> >
> > -----Original Message-----
> > From: Mark Grover [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, April 24, 2012 3:17 AM
> > To: [EMAIL PROTECTED]; bejoy ks
> > Cc: Ruben de Vries
> > Subject: Re: When/how to use partitions and buckets usefully?
> >
> > Hi Ruben,
> > Like Bejoy pointed out, members_map is small enough to fit in memory,
> > so your joins with visit_stats would be much faster with map-side join.
> >
> > However, there is still some virtue in bucketing visit_stats.
> > Bucketing can optimize joins, group by's and potentially other queries
> > in certain circumstances.
> > You probably want to keep consistent bucketing columns across all your
> > tables so they can leveraged in multi-table queries. Most people use
> > some power of 2 as their number of buckets. To make the best use of
> > the buckets, each of your buckets should be able to entirely load into
> > memory on the node.
> >
> > I use something close the formula below to calculate the number of
> buckets:
> >
> > #buckets = (x * Average_partition_size) /
> > JVM_memory_available_to_your_Hadoop_tasknode
> >
> > I call x (>1) the "factor of conservatism". Higher x means you are

Nitin Pawar
+
Edward Capriolo 2012-04-23, 16:09
+
Ruben de Vries 2012-04-23, 15:19
+
Bejoy KS 2012-04-23, 15:31
+
Tucker, Matt 2012-04-23, 15:30
+
Ruben de Vries 2012-04-23, 15:38
+
Bejoy KS 2012-04-23, 16:03
+
Ruben de Vries 2012-04-24, 11:28
+
Bejoy Ks 2012-04-24, 13:57
+
Ruben de Vries 2012-04-24, 16:09
+
gemini alex 2012-04-25, 07:36
+
gemini alex 2012-04-25, 07:40
+
Ruben de Vries 2012-04-25, 07:48
+
Mark Grover 2012-04-26, 00:59
+
gemini alex 2012-04-26, 03:37
+
Ruben de Vries 2012-04-26, 07:16
+
Ruben de Vries 2012-04-26, 09:06
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB