Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Why BucketJoinMap consume too much memory


+
binhnt22 2012-03-31, 01:16
+
Bejoy Ks 2012-04-01, 19:35
+
Amit Sharma 2012-04-03, 17:36
+
Bejoy Ks 2012-04-05, 08:07
+
binhnt22 2012-04-05, 10:07
Copy link to this message
-
Re: Why BucketJoinMap consume too much memory
can you try adding these settings
set hive.enforce.bucketing=true;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

I have tried bucketing with 1000 buckets and with more than 1TB data tables
.. they do go through fine

On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <[EMAIL PROTECTED]> wrote:

>  Hi Bejoy,****
>
> ** **
>
> Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on
> ‘calling’ column into 10 buckets.****
>
> ** **
>
> As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s
> hardly to blow the heap (1.3GB)****
>
> ** **
>
> According to wiki, I set:****
>
> ** **
>
>   set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;*
> ***
>
>   set hive.optimize.bucketmapjoin = true;****
>
>   set hive.optimize.bucketmapjoin.sortedmerge = true;****
>
> ** **
>
> And run the following SQL****
>
> ** **
>
> select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
> ra_ocs_cdr_ggsn_synthetic b ****
>
> on (a.calling = b.calling) where  a.total_volume <> b.total_volume;****
>
> ** **
>
> But it still created many hash tables then threw Java Heap space error****
>
> ** **
>
> *Best regards*
>
> Nguyen Thanh Binh (Mr)****
>
> Cell phone: (+84)98.226.0622****
>
> ** **
>
> *From:* Bejoy Ks [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, April 05, 2012 3:07 PM
> *To:* [EMAIL PROTECTED]
>
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>  ** **
>
> Hi Amit****
>
> ** **
>
>       Sorry for the delayed response, had a terrible schedule. AFAIK,
> there is no flags that would help you to take the hash table creation,
> compression and load into tmp files away from client node. ****
>
>       From my understanding if you use a Map side join, the small table as
> a whole is converted into a hash table and compressed in a tmp file. Say if
> your child jvm size is 1gb and this small table is 5GB, it'd blow off jour
> job if the map tasks tries to get such a huge file in memory. Bucketed map
> join can help here, if the table is bucketed ,say 100 buckets then each
> bucket may have around 50mb of data. ie one tmp file would be just less
> that 50mb, here mapper needs to load only the required buckets
> in memory and thus hardly run into memory issues.****
>
>     Also on the client, The records are processed bucket by bucket and
> loaded into tmp files. So if your bucket size is too large, than the heap
> size specified for your client, it'd throw an out of memory.****
>
> ** **
>
> Regards****
>
> Bejoy KS****
>
> ** **
>    ------------------------------
>
> *From:* Amit Sharma <[EMAIL PROTECTED]>
> *To:* [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]>
> *Sent:* Tuesday, April 3, 2012 11:06 PM
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>
>
> ****
>
> I am experiencing similar behavior in my queries. All the conditions for
> bucketed map join are met, and the only difference in execution when i set
> the hive.optimize.bucketmapjoin flag to true, is that instead of a single
> hash table, multiple hash tables are created. All the Hash Tables are still
> created on the client side and loaded into tmp files, which are then
> distributed to the mappers using distributed cache.
>
> Can i find any example anywhere, which shows behavior of bucketed map
> join, where in it does not create the has tables on the client itself? If
> so, is there a flag for it?
>
> Thanks,
> Amit****
>
> On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:****
>
> Hi
>     On a first look, it seems like map join is happening in your case
> other than bucketed map join. The following conditions need to hold for
> bucketed map join to work
> 1) Both the tables are bucketed on the join columns
> 2) The number of buckets in each table should be multiples of each other
> 3) Ensure that the table has enough number of buckets
>
> Note: If the data is large say 1TB(per table) and if you have just a few
> buckets say 100 buckets, each mapper may have to load 10GB>. This would

Nitin Pawar
+
binhnt22 2012-04-05, 10:52
+
Nitin Pawar 2012-04-05, 11:33
+
Bejoy Ks 2012-04-05, 12:22
+
binhnt22 2012-04-06, 01:19
+
gemini alex 2012-04-06, 07:06
+
Bejoy Ks 2012-04-06, 16:33
+
binhnt22 2012-04-09, 03:12
+
Bejoy Ks 2012-04-09, 14:48
+
binhnt22 2012-04-10, 02:40
+
Bejoy Ks 2012-04-10, 15:43
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB