Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Why BucketJoinMap consume too much memory


+
binhnt22 2012-03-31, 01:16
+
Bejoy Ks 2012-04-01, 19:35
+
Amit Sharma 2012-04-03, 17:36
+
Bejoy Ks 2012-04-05, 08:07
+
binhnt22 2012-04-05, 10:07
Copy link to this message
-
Re: Why BucketJoinMap consume too much memory
can you try adding these settings
set hive.enforce.bucketing=true;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

I have tried bucketing with 1000 buckets and with more than 1TB data tables
.. they do go through fine

On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <[EMAIL PROTECTED]> wrote:

>  Hi Bejoy,****
>
> ** **
>
> Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on
> ‘calling’ column into 10 buckets.****
>
> ** **
>
> As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s
> hardly to blow the heap (1.3GB)****
>
> ** **
>
> According to wiki, I set:****
>
> ** **
>
>   set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;*
> ***
>
>   set hive.optimize.bucketmapjoin = true;****
>
>   set hive.optimize.bucketmapjoin.sortedmerge = true;****
>
> ** **
>
> And run the following SQL****
>
> ** **
>
> select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join
> ra_ocs_cdr_ggsn_synthetic b ****
>
> on (a.calling = b.calling) where  a.total_volume <> b.total_volume;****
>
> ** **
>
> But it still created many hash tables then threw Java Heap space error****
>
> ** **
>
> *Best regards*
>
> Nguyen Thanh Binh (Mr)****
>
> Cell phone: (+84)98.226.0622****
>
> ** **
>
> *From:* Bejoy Ks [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, April 05, 2012 3:07 PM
> *To:* [EMAIL PROTECTED]
>
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>  ** **
>
> Hi Amit****
>
> ** **
>
>       Sorry for the delayed response, had a terrible schedule. AFAIK,
> there is no flags that would help you to take the hash table creation,
> compression and load into tmp files away from client node. ****
>
>       From my understanding if you use a Map side join, the small table as
> a whole is converted into a hash table and compressed in a tmp file. Say if
> your child jvm size is 1gb and this small table is 5GB, it'd blow off jour
> job if the map tasks tries to get such a huge file in memory. Bucketed map
> join can help here, if the table is bucketed ,say 100 buckets then each
> bucket may have around 50mb of data. ie one tmp file would be just less
> that 50mb, here mapper needs to load only the required buckets
> in memory and thus hardly run into memory issues.****
>
>     Also on the client, The records are processed bucket by bucket and
> loaded into tmp files. So if your bucket size is too large, than the heap
> size specified for your client, it'd throw an out of memory.****
>
> ** **
>
> Regards****
>
> Bejoy KS****
>
> ** **
>    ------------------------------
>
> *From:* Amit Sharma <[EMAIL PROTECTED]>
> *To:* [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]>
> *Sent:* Tuesday, April 3, 2012 11:06 PM
> *Subject:* Re: Why BucketJoinMap consume too much memory****
>
>
>
> ****
>
> I am experiencing similar behavior in my queries. All the conditions for
> bucketed map join are met, and the only difference in execution when i set
> the hive.optimize.bucketmapjoin flag to true, is that instead of a single
> hash table, multiple hash tables are created. All the Hash Tables are still
> created on the client side and loaded into tmp files, which are then
> distributed to the mappers using distributed cache.
>
> Can i find any example anywhere, which shows behavior of bucketed map
> join, where in it does not create the has tables on the client itself? If
> so, is there a flag for it?
>
> Thanks,
> Amit****
>
> On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:****
>
> Hi
>     On a first look, it seems like map join is happening in your case
> other than bucketed map join. The following conditions need to hold for
> bucketed map join to work
> 1) Both the tables are bucketed on the join columns
> 2) The number of buckets in each table should be multiples of each other
> 3) Ensure that the table has enough number of buckets
>
> Note: If the data is large say 1TB(per table) and if you have just a few
> buckets say 100 buckets, each mapper may have to load 10GB>. This would

Nitin Pawar
+
binhnt22 2012-04-05, 10:52
+
Nitin Pawar 2012-04-05, 11:33
+
Bejoy Ks 2012-04-05, 12:22
+
binhnt22 2012-04-06, 01:19
+
gemini alex 2012-04-06, 07:06
+
Bejoy Ks 2012-04-06, 16:33
+
binhnt22 2012-04-09, 03:12
+
Bejoy Ks 2012-04-09, 14:48
+
binhnt22 2012-04-10, 02:40
+
Bejoy Ks 2012-04-10, 15:43