|
binhnt22
2012-03-31, 01:16
Bejoy Ks
2012-04-01, 19:35
Amit Sharma
2012-04-03, 17:36
Bejoy Ks
2012-04-05, 08:07
binhnt22
2012-04-05, 10:07
Nitin Pawar
2012-04-05, 10:36
binhnt22
2012-04-05, 10:52
Nitin Pawar
2012-04-05, 11:33
Bejoy Ks
2012-04-05, 12:22
binhnt22
2012-04-06, 01:19
gemini alex
2012-04-06, 07:06
Bejoy Ks
2012-04-06, 16:33
binhnt22
2012-04-09, 03:12
Bejoy Ks
2012-04-09, 14:48
binhnt22
2012-04-10, 02:40
Bejoy Ks
2012-04-10, 15:43
|
-
Why BucketJoinMap consume too much memorybinhnt22 2012-03-31, 01:16
I have 2 table, each has 6 million records and clustered into 10 buckets
These tables are very simple with 1 key column and 1 value column, all I want is getting the key that exists in both table but different value. The normal did the trick, took only 141 secs. select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; I tried to use bucket join map by setting: set hive.optimize.bucketmapjoin = true select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; 2012-03-30 11:35:09 Starting to launch local task to process map join; maximum memory = 1398145024 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: 199999 Memory usage: 86646704 rate: 0.062 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: 299999 Memory usage: 128247464 rate: 0.092 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: 399999 Memory usage: 174041744 rate: 0.124 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: 499999 Memory usage: 214140840 rate: 0.153 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: 599999 Memory usage: 255181504 rate: 0.183 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: 699999 Memory usage: 296744320 rate: 0.212 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: 799999 Memory usage: 342538616 rate: 0.245 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: 899999 Memory usage: 384138552 rate: 0.275 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 425719576 rate: 0.304 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 467319576 rate: 0.334 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: 1199999 Memory usage: 508940504 rate: 0.364 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: 1299999 Memory usage: 550521128 rate: 0.394 2012-03-30 11:36:09 Processing rows: 1400000 Hashtable size: 1399999 Memory usage: 592121128 rate: 0.424 2012-03-30 11:36:15 Processing rows: 1500000 Hashtable size: 1499999 Memory usage: 633720336 rate: 0.453 2012-03-30 11:36:22 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 692097568 rate: 0.495 2012-03-30 11:36:33 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 725308944 rate: 0.519 2012-03-30 11:36:40 Processing rows: 1800000 Hashtable size: 1799999 Memory usage: 766946424 rate: 0.549 2012-03-30 11:36:48 Processing rows: 1900000 Hashtable size: 1899999 Memory usage: 808527928 rate: 0.578 2012-03-30 11:36:55 Processing rows: 2000000 Hashtable size: 1999999 Memory usage: 850127928 rate: 0.608 2012-03-30 11:37:08 Processing rows: 2100000 Hashtable size: 2099999 Memory usage: 891708856 rate: 0.638 2012-03-30 11:37:16 Processing rows: 2200000 Hashtable size: 2199999 Memory usage: 933308856 rate: 0.668 2012-03-30 11:37:25 Processing rows: 2300000 Hashtable size: 2299999 Memory usage: 974908856 rate: 0.697 2012-03-30 11:37:34 Processing rows: 2400000 Hashtable size: 2399999 Memory usage: 1016529448 rate: 0.727 2012-03-30 11:37:43 Processing rows: 2500000 Hashtable size: 2499999 Memory usage: 1058129496 rate: 0.757 2012-03-30 11:37:58 Processing rows: 2600000 Hashtable size: 2599999 Memory usage: 1099708832 rate: 0.787 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB heap) I want to ask you experts, why bucket join map consume too much memory? Am I wrong or my configuration is bad? Best regards,
-
Re: Why BucketJoinMap consume too much memoryBejoy Ks 2012-04-01, 19:35
Hi
On a first look, it seems like map join is happening in your case other than bucketed map join. The following conditions need to hold for bucketed map join to work 1) Both the tables are bucketed on the join columns 2) The number of buckets in each table should be multiples of each other 3) Ensure that the table has enough number of buckets Note: If the data is large say 1TB(per table) and if you have just a few buckets say 100 buckets, each mapper may have to load 10GB>. This would definitely blow your jvm . Bottom line is ensure your mappers are not heavily loaded with the bucketed data distribution. Regards Bejoy.K.S ________________________________ From: binhnt22 <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, March 31, 2012 6:46 AM Subject: Why BucketJoinMap consume too much memory I have 2 table, each has 6 million records and clustered into 10 buckets These tables are very simple with 1 key column and 1 value column, all I want is getting the key that exists in both table but different value. The normal did the trick, took only 141 secs. select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; I tried to use bucket join map by setting: set hive.optimize.bucketmapjoin = true select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling b.calling) where a.total_volume <> b.total_volume; 2012-03-30 11:35:09 Starting to launch local task to process map join; maximum memory = 1398145024 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: 199999 Memory usage: 86646704 rate: 0.062 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: 299999 Memory usage: 128247464 rate: 0.092 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: 399999 Memory usage: 174041744 rate: 0.124 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: 499999 Memory usage: 214140840 rate: 0.153 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: 599999 Memory usage: 255181504 rate: 0.183 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: 699999 Memory usage: 296744320 rate: 0.212 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: 799999 Memory usage: 342538616 rate: 0.245 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: 899999 Memory usage: 384138552 rate: 0.275 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 425719576 rate: 0.304 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 467319576 rate: 0.334 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: 1199999 Memory usage: 508940504 rate: 0.364 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: 1299999 Memory usage: 550521128 rate: 0.394 2012-03-30 11:36:09 Processing rows: 1400000 Hashtable size: 1399999 Memory usage: 592121128 rate: 0.424 2012-03-30 11:36:15 Processing rows: 1500000 Hashtable size: 1499999 Memory usage: 633720336 rate: 0.453 2012-03-30 11:36:22 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 692097568 rate: 0.495 2012-03-30 11:36:33 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 725308944 rate: 0.519 2012-03-30 11:36:40 Processing rows: 1800000 Hashtable size: 1799999 Memory usage: 766946424 rate: 0.549 2012-03-30 11:36:48 Processing rows: 1900000 Hashtable size: 1899999 Memory usage: 808527928 rate: 0.578 2012-03-30 11:36:55 Processing rows: 2000000 Hashtable size: 1999999 Memory usage: 850127928 rate: 0.608 2012-03-30 11:37:08 Processing rows: 2100000 Hashtable size: 2099999 Memory usage: 891708856 rate: 0.638 2012-03-30 11:37:16 Processing rows: 2200000 Hashtable size: 2199999 Memory usage: 933308856 rate: 0.668 2012-03-30 11:37:25 Processing rows: 2300000 Hashtable size: 2299999 Memory usage: 974908856 rate: 0.697 2012-03-30 11:37:34 Processing rows: 2400000 Hashtable size: 2399999 Memory usage: 1016529448 rate: 0.727 2012-03-30 11:37:43 Processing rows: 2500000 Hashtable size: 2499999 Memory usage: 1058129496 rate: 0.757 2012-03-30 11:37:58 Processing rows: 2600000 Hashtable size: 2599999 Memory usage: 1099708832 rate: 0.787 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space My system has 4 PC, each has CPU E2180, 2GB ram, 80GB HDD, one of them containts NameNode, JobTracker, Hive Server and all of them contain DataNode, TaskTracker In all node, I set: export HADOOP_HEAPSIZE=1500 in hadoop-env.sh (~ 1.3GB heap) I want to ask you experts, why bucket join map consume too much memory? Am I wrong or my configuration is bad? Best regards,
-
Re: Why BucketJoinMap consume too much memoryAmit Sharma 2012-04-03, 17:36
I am experiencing similar behavior in my queries. All the conditions for
bucketed map join are met, and the only difference in execution when i set the hive.optimize.bucketmapjoin flag to true, is that instead of a single hash table, multiple hash tables are created. All the Hash Tables are still created on the client side and loaded into tmp files, which are then distributed to the mappers using distributed cache. Can i find any example anywhere, which shows behavior of bucketed map join, where in it does not create the has tables on the client itself? If so, is there a flag for it? Thanks, Amit On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote: > Hi > On a first look, it seems like map join is happening in your case > other than bucketed map join. The following conditions need to hold for > bucketed map join to work > 1) Both the tables are bucketed on the join columns > 2) The number of buckets in each table should be multiples of each other > 3) Ensure that the table has enough number of buckets > > Note: If the data is large say 1TB(per table) and if you have just a few > buckets say 100 buckets, each mapper may have to load 10GB>. This would > definitely blow your jvm . Bottom line is ensure your mappers are not > heavily loaded with the bucketed data distribution. > > Regards > Bejoy.K.S > ------------------------------ > *From:* binhnt22 <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Saturday, March 31, 2012 6:46 AM > *Subject:* Why BucketJoinMap consume too much memory > > I have 2 table, each has 6 million records and clustered into 10 > buckets > > > > These tables are very simple with 1 key column and 1 value column, all I > want is getting the key that exists in both table but different value. > > > > The normal did the trick, took only 141 secs. > > > > select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b > on (a.calling = b.calling) where a.total_volume <> b.total_volume; > > > > I tried to use bucket join map by setting: *set > hive.optimize.bucketmapjoin = true* > > > > select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join > ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where > a.total_volume <> b.total_volume; > > > > 2012-03-30 11:35:09 Starting to launch local task to process map > join; maximum memory = 1398145024 > > 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: > 199999 Memory usage: 86646704 rate: 0.062 > > 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: > 299999 Memory usage: 128247464 rate: 0.092 > > 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: > 399999 Memory usage: 174041744 rate: 0.124 > > 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: > 499999 Memory usage: 214140840 rate: 0.153 > > 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: > 599999 Memory usage: 255181504 rate: 0.183 > > 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: > 699999 Memory usage: 296744320 rate: 0.212 > > 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: > 799999 Memory usage: 342538616 rate: 0.245 > > 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: > 899999 Memory usage: 384138552 rate: 0.275 > > 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: > 999999 Memory usage: 425719576 rate: 0.304 > > 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: > 1099999 Memory usage: 467319576 rate: 0.334 > > 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: > 1199999 Memory usage: 508940504 rate: 0.364 > > 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: > 1299999 Memory usage: 550521128 rate: 0.394 > > 2012-03-30 11:36:09 Processing rows: 1400000 Hashtable size:
-
Re: Why BucketJoinMap consume too much memoryBejoy Ks 2012-04-05, 08:07
Hi Amit
Sorry for the delayed response, had a terrible schedule. AFAIK, there is no flags that would help you to take the hash table creation, compression and load into tmp files away from client node. From my understanding if you use a Map side join, the small table as a whole is converted into a hash table and compressed in a tmp file. Say if your child jvm size is 1gb and this small table is 5GB, it'd blow off jour job if the map tasks tries to get such a huge file in memory. Bucketed map join can help here, if the table is bucketed ,say 100 buckets then each bucket may have around 50mb of data. ie one tmp file would be just less that 50mb, here mapper needs to load only the required buckets in memory and thus hardly run into memory issues. Also on the client, The records are processed bucket by bucket and loaded into tmp files. So if your bucket size is too large, than the heap size specified for your client, it'd throw an out of memory. Regards Bejoy KS ________________________________ From: Amit Sharma <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> Sent: Tuesday, April 3, 2012 11:06 PM Subject: Re: Why BucketJoinMap consume too much memory I am experiencing similar behavior in my queries. All the conditions for bucketed map join are met, and the only difference in execution when i set the hive.optimize.bucketmapjoin flag to true, is that instead of a single hash table, multiple hash tables are created. All the Hash Tables are still created on the client side and loaded into tmp files, which are then distributed to the mappers using distributed cache. Can i find any example anywhere, which shows behavior of bucketed map join, where in it does not create the has tables on the client itself? If so, is there a flag for it? Thanks, Amit On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote: Hi > On a first look, it seems like map join is happening in your case other than bucketed map join. The following conditions need to hold for bucketed map join to work >1) Both the tables are bucketed on the join columns >2) The number of buckets in each table should be multiples of each other >3) Ensure that the table has enough number of buckets > >Note: If the data is large say 1TB(per table) and if you have just a few buckets say 100 buckets, each mapper may have to load 10GB>. This would definitely blow your jvm . Bottom line is ensure your mappers are not heavily loaded with the bucketed data distribution. > >Regards >Bejoy.K.S > > >________________________________ > From: binhnt22 <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Saturday, March 31, 2012 6:46 AM >Subject: Why BucketJoinMap consume too much memory > > > >I have 2 table, each has 6 million records and clustered into 10 buckets > >These tables are very simple with 1 key column and 1 value column, all I want is getting the key that exists in both table but different value. > >The normal did the trick, took only 141 secs. > >select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; > >I tried to use bucket join map by setting: set hive.optimize.bucketmapjoin = true > >select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling b.calling) where a.total_volume <> b.total_volume; > >2012-03-30 11:35:09 Starting to launch local task to process map join; maximum memory = 1398145024 >2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: 199999 Memory usage: 86646704 rate: 0.062 >2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: 299999 Memory usage: 128247464 rate: 0.092 >2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: 399999 Memory usage: 174041744 rate: 0.124 >2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: 499999 Memory usage: 214140840 rate: 0.153 11:35:25 Processing rows: 600000 Hashtable size: 599999 Memory usage: 255181504 rate: 0.183 11:35:29 Processing rows: 700000 Hashtable size: 699999 Memory usage: 296744320 rate: 0.212 11:35:35 Processing rows: 800000 Hashtable size: 799999 Memory usage: 342538616 rate: 0.245 11:35:38 Processing rows: 900000 Hashtable size: 899999 Memory usage: 384138552 rate: 0.275 11:35:45 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 425719576 rate: 0.304 11:35:50 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 467319576 rate: 0.334 11:35:56 Processing rows: 1200000 Hashtable size: 1199999 Memory usage: 508940504 rate: 0.364 11:36:04 Processing rows: 1300000 Hashtable size: 1299999 Memory usage: 550521128 rate: 0.394 11:36:09 Processing rows: 1400000 Hashtable size: 1399999 Memory usage: 592121128 rate: 0.424 11:36:15 Processing rows: 1500000 Hashtable size: 1499999 Memory usage: 633720336 rate: 0.453 11:36:22 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 692097568 rate: 0.495 11:36:33 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 725308944 rate: 0.519 11:36:40 Processing rows: 1800000 Hashtable size: 1799999 Memory usage: 766946424 rate: 0.549 11:36:48 Processing rows: 1900000 Hashtable size: 1899999 Memory usage: 808527928 rate: 0.578 11:36:55 Processing rows: 2000000 Hashtable size: 1999999 Memory usage: 850127928 rate: 0.608 11:37:08 Processing rows: 2100000 Hashtable size: 2099999 Memory usage: 891708856 rate: 0.638 11:37:16 Processing rows: 2200000 Hashtable size: 2199999 Memory usage: 933308856 rate: 0.6
-
RE: Why BucketJoinMap consume too much memorybinhnt22 2012-04-05, 10:07
Hi Bejoy,
Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on 'calling' column into 10 buckets. As you said, hive will load only 1 bucket ~ 180-190MB into memory. That's hardly to blow the heap (1.3GB) According to wiki, I set: set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; And run the following SQL select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; But it still created many hash tables then threw Java Heap space error Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 05, 2012 3:07 PM To: [EMAIL PROTECTED] Subject: Re: Why BucketJoinMap consume too much memory Hi Amit Sorry for the delayed response, had a terrible schedule. AFAIK, there is no flags that would help you to take the hash table creation, compression and load into tmp files away from client node. From my understanding if you use a Map side join, the small table as a whole is converted into a hash table and compressed in a tmp file. Say if your child jvm size is 1gb and this small table is 5GB, it'd blow off jour job if the map tasks tries to get such a huge file in memory. Bucketed map join can help here, if the table is bucketed ,say 100 buckets then each bucket may have around 50mb of data. ie one tmp file would be just less that 50mb, here mapper needs to load only the required buckets in memory and thus hardly run into memory issues. Also on the client, The records are processed bucket by bucket and loaded into tmp files. So if your bucket size is too large, than the heap size specified for your client, it'd throw an out of memory. Regards Bejoy KS _____ From: Amit Sharma <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> Sent: Tuesday, April 3, 2012 11:06 PM Subject: Re: Why BucketJoinMap consume too much memory I am experiencing similar behavior in my queries. All the conditions for bucketed map join are met, and the only difference in execution when i set the hive.optimize.bucketmapjoin flag to true, is that instead of a single hash table, multiple hash tables are created. All the Hash Tables are still created on the client side and loaded into tmp files, which are then distributed to the mappers using distributed cache. Can i find any example anywhere, which shows behavior of bucketed map join, where in it does not create the has tables on the client itself? If so, is there a flag for it? Thanks, Amit On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote: Hi On a first look, it seems like map join is happening in your case other than bucketed map join. The following conditions need to hold for bucketed map join to work 1) Both the tables are bucketed on the join columns 2) The number of buckets in each table should be multiples of each other 3) Ensure that the table has enough number of buckets Note: If the data is large say 1TB(per table) and if you have just a few buckets say 100 buckets, each mapper may have to load 10GB>. This would definitely blow your jvm . Bottom line is ensure your mappers are not heavily loaded with the bucketed data distribution. Regards Bejoy.K.S _____ From: binhnt22 <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, March 31, 2012 6:46 AM Subject: Why BucketJoinMap consume too much memory I have 2 table, each has 6 million records and clustered into 10 buckets These tables are very simple with 1 key column and 1 value column, all I want is getting the key that exists in both table but different value. The normal did the trick, took only 141 secs. select * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; I tried to use bucket join map by setting: set hive.optimize.bucketmapjoin = true select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; 2012-03-30 11:35:09 Starting to launch local task to process map join; maximum memory = 1398145024 2012-03-30 11:35:12 Processing rows: 200000 Hashtable size: 199999 Memory usage: 86646704 rate: 0.062 2012-03-30 11:35:15 Processing rows: 300000 Hashtable size: 299999 Memory usage: 128247464 rate: 0.092 2012-03-30 11:35:18 Processing rows: 400000 Hashtable size: 399999 Memory usage: 174041744 rate: 0.124 2012-03-30 11:35:21 Processing rows: 500000 Hashtable size: 499999 Memory usage: 214140840 rate: 0.153 2012-03-30 11:35:25 Processing rows: 600000 Hashtable size: 599999 Memory usage: 255181504 rate: 0.183 2012-03-30 11:35:29 Processing rows: 700000 Hashtable size: 699999 Memory usage: 296744320 rate: 0.212 2012-03-30 11:35:35 Processing rows: 800000 Hashtable size: 799999 Memory usage: 342538616 rate: 0.245 2012-03-30 11:35:38 Processing rows: 900000 Hashtable size: 899999 Memory usage: 384138552 rate: 0.275 2012-03-30 11:35:45 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 425719576 rate: 0.304 2012-03-30 11:35:50 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 467319576 rate: 0.334 2012-03-30 11:35:56 Processing rows: 1200000 Hashtable size: 1199999 Memory usage: 508940504 rate: 0.364 2012-03-30 11:36:04 Processing rows: 1300000 Hashtable size: 1299999 Memory usage: 550521128 rate: 0.394 2012-03-30 11:36:09 Processing r
-
Re: Why BucketJoinMap consume too much memoryNitin Pawar 2012-04-05, 10:36
can you try adding these settings
set hive.enforce.bucketing=true; hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; I have tried bucketing with 1000 buckets and with more than 1TB data tables .. they do go through fine On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <[EMAIL PROTECTED]> wrote: > Hi Bejoy,**** > > ** ** > > Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on > ‘calling’ column into 10 buckets.**** > > ** ** > > As you said, hive will load only 1 bucket ~ 180-190MB into memory. That’s > hardly to blow the heap (1.3GB)**** > > ** ** > > According to wiki, I set:**** > > ** ** > > set > hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;* > *** > > set hive.optimize.bucketmapjoin = true;**** > > set hive.optimize.bucketmapjoin.sortedmerge = true;**** > > ** ** > > And run the following SQL**** > > ** ** > > select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join > ra_ocs_cdr_ggsn_synthetic b **** > > on (a.calling = b.calling) where a.total_volume <> b.total_volume;**** > > ** ** > > But it still created many hash tables then threw Java Heap space error**** > > ** ** > > *Best regards* > > Nguyen Thanh Binh (Mr)**** > > Cell phone: (+84)98.226.0622**** > > ** ** > > *From:* Bejoy Ks [mailto:[EMAIL PROTECTED]] > *Sent:* Thursday, April 05, 2012 3:07 PM > *To:* [EMAIL PROTECTED] > > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > ** ** > > Hi Amit**** > > ** ** > > Sorry for the delayed response, had a terrible schedule. AFAIK, > there is no flags that would help you to take the hash table creation, > compression and load into tmp files away from client node. **** > > From my understanding if you use a Map side join, the small table as > a whole is converted into a hash table and compressed in a tmp file. Say if > your child jvm size is 1gb and this small table is 5GB, it'd blow off jour > job if the map tasks tries to get such a huge file in memory. Bucketed map > join can help here, if the table is bucketed ,say 100 buckets then each > bucket may have around 50mb of data. ie one tmp file would be just less > that 50mb, here mapper needs to load only the required buckets > in memory and thus hardly run into memory issues.**** > > Also on the client, The records are processed bucket by bucket and > loaded into tmp files. So if your bucket size is too large, than the heap > size specified for your client, it'd throw an out of memory.**** > > ** ** > > Regards**** > > Bejoy KS**** > > ** ** > ------------------------------ > > *From:* Amit Sharma <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> > *Sent:* Tuesday, April 3, 2012 11:06 PM > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > > > **** > > I am experiencing similar behavior in my queries. All the conditions for > bucketed map join are met, and the only difference in execution when i set > the hive.optimize.bucketmapjoin flag to true, is that instead of a single > hash table, multiple hash tables are created. All the Hash Tables are still > created on the client side and loaded into tmp files, which are then > distributed to the mappers using distributed cache. > > Can i find any example anywhere, which shows behavior of bucketed map > join, where in it does not create the has tables on the client itself? If > so, is there a flag for it? > > Thanks, > Amit**** > > On Sun, Apr 1, 2012 at 12:35 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:**** > > Hi > On a first look, it seems like map join is happening in your case > other than bucketed map join. The following conditions need to hold for > bucketed map join to work > 1) Both the tables are bucketed on the join columns > 2) The number of buckets in each table should be multiples of each other > 3) Ensure that the table has enough number of buckets > > Note: If the data is large say 1TB(per table) and if you have just a few > buckets say 100 buckets, each mapper may have to load 10GB>. This would Nitin Pawar
-
RE: Why BucketJoinMap consume too much memorybinhnt22 2012-04-05, 10:52
Thank Nitin,
I tried but no luck. Here's hive log, please spend a little time to view it. hive> set hive.optimize.bucketmapjoin = true; hive> set hive.enforce.bucketing=true; hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b > on (a.calling = b.calling) where a.total_volume <> b.total_volume; Total MapReduce jobs = 1 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Execution log at: /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log 2012-04-05 10:37:45 Starting to launch local task to process map join; maximum memory = 1398145024 2012-04-05 10:37:48 Processing rows: 200000 Hashtable size: 199999 Memory usage: 75403880 rate: 0.054 2012-04-05 10:37:50 Processing rows: 300000 Hashtable size: 299999 Memory usage: 111404664 rate: 0.08 2012-04-05 10:37:54 Processing rows: 400000 Hashtable size: 399999 Memory usage: 151598960 rate: 0.108 2012-04-05 10:38:04 Processing rows: 500000 Hashtable size: 499999 Memory usage: 185483368 rate: 0.133 2012-04-05 10:38:09 Processing rows: 600000 Hashtable size: 599999 Memory usage: 221483392 rate: 0.158 2012-04-05 10:38:13 Processing rows: 700000 Hashtable size: 699999 Memory usage: 257482640 rate: 0.184 2012-04-05 10:38:19 Processing rows: 800000 Hashtable size: 799999 Memory usage: 297676944 rate: 0.213 2012-04-05 10:38:22 Processing rows: 900000 Hashtable size: 899999 Memory usage: 333676968 rate: 0.239 2012-04-05 10:38:27 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 369676944 rate: 0.264 2012-04-05 10:38:31 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 405676968 rate: 0.29 2012-04-05 10:38:36 Processing rows: 1200000 Hashtable size: 1199999 Memory usage: 441676944 rate: 0.316 2012-04-05 10:38:42 Processing rows: 1300000 Hashtable size: 1299999 Memory usage: 477676944 rate: 0.342 2012-04-05 10:38:47 Processing rows: 1400000 Hashtable size: 1399999 Memory usage: 513676968 rate: 0.367 2012-04-05 10:38:52 Processing rows: 1500000 Hashtable size: 1499999 Memory usage: 549676944 rate: 0.393 2012-04-05 10:39:00 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 602454200 rate: 0.431 2012-04-05 10:39:08 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 630065552 rate: 0.451 2012-04-05 10:39:14 Processing rows: 1800000 Hashtable size: 1799999 Memory usage: 666065552 rate: 0.476 2012-04-05 10:39:20 Processing rows: 1900000 Hashtable size: 1899999 Memory usage: 702065552 rate: 0.502 2012-04-05 10:39:26 Processing rows: 2000000 Hashtable size: 1999999 Memory usage: 738065576 rate: 0.528 2012-04-05 10:39:36 Processing rows: 2100000 Hashtable size: 2099999 Memory usage: 774065552 rate: 0.554 2012-04-05 10:39:43 Processing rows: 2200000 Hashtable size: 2199999 Memory usage: 810065552 rate: 0.579 2012-04-05 10:39:51 Processing rows: 2300000 Hashtable size: 2299999 Memory usage: 846065576 rate: 0.605 2012-04-05 10:40:16 Processing rows: 2400000 Hashtable size: 2399999 Memory usage: 882085136 rate: 0.631 2012-04-05 10:40:24 Processing rows: 2500000 Hashtable size: 2499999 Memory usage: 918085208 rate: 0.657 2012-04-05 10:40:39 Processing rows: 2600000 Hashtable size: 2599999 Memory usage: 954065544 rate: 0.682 2012-04-05 10:40:48 Processing rows: 2700000 Hashtable size: 2699999 Memory usage: 990065568 rate: 0.708 2012-04-05 10:40:56 Processing rows: 2800000 Hashtable size: 2799999 Memory usage: 1026065552 rate: 0.734 2012-04-05 10:41:07 Processing rows: 2900000 Hashtable size: 2899999 Memory usage: 1062065576 rate: 0.76 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 05, 2012 5:36 PM To: [EMAIL PROTECTED] Subject: Re: Why BucketJoinMap consume too much memory can you try adding these settings set hive.enforce.bucketing=true; hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; I have tried bucketing with 1000 buckets and with more than 1TB data tables .. they do go through fine On Thu, Apr 5, 2012 at 3:37 PM, binhnt22 <[EMAIL PROTECTED]> wrote: Hi Bejoy, Both my tables has 65m records ( ~ 1.8-1.9GB on hadoop) and bucketized on 'calling' column into 10 buckets. As you said, hive will load only 1 bucket ~ 180-190MB into memory. That's hardly to blow the heap (1.3GB) According to wiki, I set: set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; And run the following SQL select /*+ MAPJOIN(a) */ * from ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b on (a.calling = b.calling) where a.total_volume <> b.total_volume; But it still created many hash tables then threw Java Heap space error Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 05, 2012 3:07 PM To: [EMAIL PROTECTED] Subject: Re: Why BucketJoinMap consume too much memory Hi Amit Sorry for the delayed
-
Re: Why BucketJoinMap consume too much memoryNitin Pawar 2012-04-05, 11:33
Can you tell me the size of table b?
If you are doing bucketing and still size b table is huge then it will reach this problem On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <[EMAIL PROTECTED]> wrote: > Thank Nitin,**** > > ** ** > > I tried but no luck. Here’s hive log, please spend a little time to view > it.**** > > ** ** > > hive> set hive.optimize.bucketmapjoin = true;**** > > hive> set hive.enforce.bucketing=true;**** > > hive> set > hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;**** > > hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b**** > > > on (a.calling = b.calling) where a.total_volume <> b.total_volume;* > *** > > Total MapReduce jobs = 1**** > > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please > use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties > files.**** > > Execution log at: > /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log > **** > > 2012-04-05 10:37:45 Starting to launch local task to process map > join; maximum memory = 1398145024**** > > 2012-04-05 10:37:48 Processing rows: 200000 Hashtable size: > 199999 Memory usage: 75403880 rate: 0.054**** > > 2012-04-05 10:37:50 Processing rows: 300000 Hashtable size: > 299999 Memory usage: 111404664 rate: 0.08**** > > 2012-04-05 10:37:54 Processing rows: 400000 Hashtable size: > 399999 Memory usage: 151598960 rate: 0.108**** > > 2012-04-05 10:38:04 Processing rows: 500000 Hashtable size: > 499999 Memory usage: 185483368 rate: 0.133**** > > 2012-04-05 10:38:09 Processing rows: 600000 Hashtable size: > 599999 Memory usage: 221483392 rate: 0.158**** > > 2012-04-05 10:38:13 Processing rows: 700000 Hashtable size: > 699999 Memory usage: 257482640 rate: 0.184**** > > 2012-04-05 10:38:19 Processing rows: 800000 Hashtable size: > 799999 Memory usage: 297676944 rate: 0.213**** > > 2012-04-05 10:38:22 Processing rows: 900000 Hashtable size: > 899999 Memory usage: 333676968 rate: 0.239**** > > 2012-04-05 10:38:27 Processing rows: 1000000 Hashtable size: > 999999 Memory usage: 369676944 rate: 0.264**** > > 2012-04-05 10:38:31 Processing rows: 1100000 Hashtable size: > 1099999 Memory usage: 405676968 rate: 0.29**** > > 2012-04-05 10:38:36 Processing rows: 1200000 Hashtable size: > 1199999 Memory usage: 441676944 rate: 0.316**** > > 2012-04-05 10:38:42 Processing rows: 1300000 Hashtable size: > 1299999 Memory usage: 477676944 rate: 0.342**** > > 2012-04-05 10:38:47 Processing rows: 1400000 Hashtable size: > 1399999 Memory usage: 513676968 rate: 0.367**** > > 2012-04-05 10:38:52 Processing rows: 1500000 Hashtable size: > 1499999 Memory usage: 549676944 rate: 0.393**** > > 2012-04-05 10:39:00 Processing rows: 1600000 Hashtable size: > 1599999 Memory usage: 602454200 rate: 0.431**** > > 2012-04-05 10:39:08 Processing rows: 1700000 Hashtable size: > 1699999 Memory usage: 630065552 rate: 0.451**** > > 2012-04-05 10:39:14 Processing rows: 1800000 Hashtable size: > 1799999 Memory usage: 666065552 rate: 0.476**** > > 2012-04-05 10:39:20 Processing rows: 1900000 Hashtable size: > 1899999 Memory usage: 702065552 rate: 0.502**** > > 2012-04-05 10:39:26 Processing rows: 2000000 Hashtable size: > 1999999 Memory usage: 738065576 rate: 0.528**** > > 2012-04-05 10:39:36 Processing rows: 2100000 Hashtable size: > 2099999 Memory usage: 774065552 rate: 0.554**** > > 2012-04-05 10:39:43 Processing rows: 2200000 Hashtable size: > 2199999 Memory usage: 810065552 rate: 0.579**** > > 2012-04-05 10:39:51 Processing rows: 2300000 Hashtable size: Nitin Pawar
-
Re: Why BucketJoinMap consume too much memoryBejoy Ks 2012-04-05, 12:22
Hi Binh
I was just checking your local map join log , and I noticed two things - the memory usage by one hash table has got beyond 1G. - Number of rows processed is just 2M It is possible that, Each bucket it self is too large to be loaded in memory. As a work around or to nail down the bucket size is the issue here, can you try increasing the number of buckets to 100 and try doing a bucketed map join. Also you mentioned the data size is 2Gb, is it the compressed data size? 2012-04-05 10:41:07 Processing rows: 2,900,000 Hashtable size: 2899999 Memory usage: 1,062,065,576 rate: 0.76 Regards Bejoy KS ________________________________ From: Nitin Pawar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, April 5, 2012 5:03 PM Subject: Re: Why BucketJoinMap consume too much memory Can you tell me the size of table b? If you are doing bucketing and still size b table is huge then it will reach this problem On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <[EMAIL PROTECTED]> wrote: Thank Nitin, > >I tried but no luck. Here’s hive log, please spend a little time to view it. > >hive> set hive.optimize.bucketmapjoin = true; >hive> set hive.enforce.bucketing=true; >hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; >hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b >�� > on (a.calling = b.calling) where a.total_volume <> b.total_volume; >Total MapReduce jobs = 1 >WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. >Execution log at: /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log >2012-04-05 10:37:45 Starting to launch local task to process map join; maximum memory = 1398145024 >2012-04-05 10:37:48 Processing rows: 200000 Hashtable size: 199999 Memory usage: 75403880 rate: 0.054 >2012-04-05 10:37:50 Processing rows: 300000 Hashtable size: 299999 Memory usage:�� 111404664 rate: 0.08 >2012-04-05 10:37:54 Processing rows: 400000 Hashtable size: 399999�� Memory usage: 151598960 �� rate: 0.108 >2012-04-05 10:38:04 �� Processing rows: 500000 Hashtable size: 499999 Memory usage: 185483368 rate: 0.133 >2012-04-05 10:38:09 Processing rows: 600000 Hashtable size: 599999 Memory usage:�� 221483392 rate: 0.158 >2012-04-05 10:38:13 Processing rows:�� 700000 Hashtable size: 699999 Memory usage: 257482640 rate: 0.184 >2012-04-05 10:38:19 Processing rows: 800000�� Hashtable size: 799999 Memory usage: 297676944�� rate: 0.213 >2012-04-05 10:38:22 Processing rows: �� 900000 Hashtable size: 899999 Memory usage: 333676968 rate: �� 0.239 >2012-04-05 10:38:27 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 369676944 rate: 0.264 >2012-04-05 10:38:31 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 405676968 �� rate: 0.29 >2012-04-05 10:38:36 Processing rows: �� 1200000 Hashtable size: 1199999 Memory usage: 441676944 rate: 0.316 >2012-04-05 10:38:42 Processing rows: �� 1300000 Hashtable size: 1299999 Memory usage: 477676944 rate: 0.342 >2012-04-05 10:38:47 Processing rows: 1400000 Hashtable size: 1399999 Memory usage: 513676968 rate: 0.367 >2012-04-05 10:38:52 Processing rows: 1500000 Hashtable size: 1499999 Memory usage: 549676944 �� rate: 0.393 >2012-04-05 10:39:00 �� Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 602454200 rate: 0.431 >2012-04-05 10:39:08 Processing rows: �� 1700000 Hashtable size: 1699999 Memory usage: 630065552 rate: 0.451 >2012-04-05 10:39:14 Processing rows: 1800000 Hashtable size: 1799999 Memory usage:�� 666065552 rate: 0.476 10:39:20 Processing rows:�� 1900000 Hashtable size: 1899999 Memory usage: 702065552 rate: 0.502 10:39:26 Processing rows: 2000000 Hashtable size: 1999999 Memory usage: 738065576 �� rate: 0.528 10:39:36 �� Processing rows: 2100000 Hashtable size: 2099999 Memory usage: 774065552 rate: 0.554 10:39:43 Processing rows: �� 2200000 Hashtable size: 2199999 Memory usage: 810065552 rate: 0.579 10:39:51 Processing rows: 2300000 Hashtable size: 2299999 Memory usage:�� 846065576 rate: 0.605 10:40:16 Processing rows:�� 2400000 Hashtable size: 2399999 Memory usage: 882085136 rate: 0.631 10:40:24 Processing rows: 2500000 Hashtable size: 2499999 Memory usage: 918085208 �� rate: 0.657 10:40:39 �� Processing rows: 2600000 Hashtable size: 2599999 Memory usage: 954065544 rate: 0.682 10:40:48 Processing rows: �� 2700000 Hashtable size: 2699999 Memory usage: 990065568 rate: 0.708 10:40:56 Processing rows: 2800000 Hashtable size: 2799999 Memory usage:�� 1026065552 rate: 0.734 10:41:07 Processing rows: �� 2900000 Hashtable size: 2899999 Memory usage: 1062065576 rate: �� 0.76 thread "Thread-1" java.lang.OutOfMemoryError: Java heap space [mailto:[EMAIL PROTECTED]] 1TB data tables .. they do go through fine 1.8-1.9GB on hadoop) and bucketized on ‘calling’ column into 10 buckets. bucket ~ 180-190MB into memory. That’s hardly to blow the heap (1.3GB) hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; hive.optimize.bucketmapjoin = true; hive.optimize.bucketmapjoin.sortedmerge = true; ra_md_cdr_ggsn_synthetic a join ra_ocs_cdr_ggsn_synthetic b a.total_volume <> b.total_volume; then threw Java Heap space error Thanh Binh (Mr) phone: (+84)98.226.0622 terrible schedule. AFAIK, there is no flags that would help you to take the hash table creatio
-
RE: Why BucketJoinMap consume too much memorybinhnt22 2012-04-06, 01:19
Hi Bejoy,
Sorry for late response. I will start to demonstrate over again to clear some information. I have 2 tables, nearly same. Both has the same table structure, 65m records, 2GB size (same size). hive> describe ra_md_syn; OK calling string total_duration bigint total_volume bigint total_charge bigint Both of them were bucketized into 256 buckets on ‘calling’ column (in the last time only 10 buckets, I tried to increase it as you suggested). And I want to find all ‘calling’ exists in both tables but different ‘total_volume’ The script as you knew: hive> set hive.optimize.bucketmapjoin = true; hive> set hive.enforce.bucketing=true; hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b > on (a.calling = b.calling) where a.total_volume <> b.total_volume; And the result was exactly in my last email. Java heap space error. With total size is only 2GB and 256 buckets, I think bucket size is impossible to be the issue here. Please give me some advice, I really appreciate Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 05, 2012 7:23 PM To: [EMAIL PROTECTED] Subject: Re: Why BucketJoinMap consume too much memory Hi Binh I was just checking your local map join log , and I noticed two things - the memory usage by one hash table has got beyond 1G. - Number of rows processed is just 2M It is possible that, Each bucket it self is too large to be loaded in memory. As a work around or to nail down the bucket size is the issue here, can you try increasing the number of buckets to 100 and try doing a bucketed map join. Also you mentioned the data size is 2Gb, is it the compressed data size? 2012-04-05 10:41:07 Processing rows: 2,900,000 Hashtable size: 2899999 Memory usage: 1,062,065,576 rate: 0.76 Regards Bejoy KS _____ From: Nitin Pawar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, April 5, 2012 5:03 PM Subject: Re: Why BucketJoinMap consume too much memory Can you tell me the size of table b? If you are doing bucketing and still size b table is huge then it will reach this problem On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <[EMAIL PROTECTED]> wrote: Thank Nitin, I tried but no luck. Here’s hive log, please spend a little time to view it. hive> set hive.optimize.bucketmapjoin = true; hive> set hive.enforce.bucketing=true; hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b > on (a.calling = b.calling) where a.total_volume <> b.total_volume; Total MapReduce jobs = 1 WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Execution log at: /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log 2012-04-05 10:37:45 Starting to launch local task to process map join; maximum memory = 1398145024 2012-04-05 10:37:48 Processing rows: 200000 Hashtable size: 199999 Memory usage: 75403880 rate: 0.054 2012-04-05 10:37:50 Processing rows: 300000 Hashtable size: 299999 Memory usage: 111404664 rate: 0.08 2012-04-05 10:37:54 Processing rows: 400000 Hashtable size: 399999 Memory usage: 151598960 rate: 0.108 2012-04-05 10:38:04 Processing rows: 500000 Hashtable size: 499999 Memory usage: 185483368 rate: 0.133 2012-04-05 10:38:09 Processing rows: 600000 Hashtable size: 599999 Memory usage: 221483392 rate: 0.158 2012-04-05 10:38:13 Processing rows: 700000 Hashtable size: 699999 Memory usage: 257482640 rate: 0.184 2012-04-05 10:38:19 Processing rows: 800000 Hashtable size: 799999 Memory usage: 297676944 rate: 0.213 2012-04-05 10:38:22 Processing rows: 900000 Hashtable size: 899999 Memory usage: 333676968 rate: 0.239 2012-04-05 10:38:27 Processing rows: 1000000 Hashtable size: 999999 Memory usage: 369676944 rate: 0.264 2012-04-05 10:38:31 Processing rows: 1100000 Hashtable size: 1099999 Memory usage: 405676968 rate: 0.29 2012-04-05 10:38:36 Processing rows: 1200000 Hashtable size: 1199999 Memory usage: 441676944 rate: 0.316 2012-04-05 10:38:42 Processing rows: 1300000 Hashtable size: 1299999 Memory usage: 477676944 rate: 0.342 2012-04-05 10:38:47 Processing rows: 1400000 Hashtable size: 1399999 Memory usage: 513676968 rate: 0.367 2012-04-05 10:38:52 Processing rows: 1500000 Hashtable size: 1499999 Memory usage: 549676944 rate: 0.393 2012-04-05 10:39:00 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 602454200 rate: 0.431 2012-04-05 10:39:08 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 630065552 rate: 0.451 2012-04-05 10:39:14 Processing rows: 1800000 Hashtable size: 1799999 Memory usage: 666065552 rate: 0.476 2012-04-05 10:39:20 Processing rows: 1900000 Hashtable size: 1899999 Memory usage: 702065552 rate: 0.502 2012-04-05 10:39:26 Processing rows: 2000000 Hashtable size: 1999999 Memory usage: 738065576 rate: 0.528 2012-04-05 10:39:36 Processing rows: 2100000 Hashtable size: 2099999 Memory usage: 774065552 rate: 0.554 2012-04-05 10:39:43 Processing rows: 2200000 Hashtable size: 2199999 Memory usage: 810065552 rate: 0.579 2012-04-05 10:39:51 Processing rows: 2300000 Hashtable size: 2299999 Memory usage: 846065576 rate:
-
Re: Why BucketJoinMap consume too much memorygemini alex 2012-04-06, 07:06
I guess the problem is you can't using <> predicate in bucket join, try to
select c.* from ( select /*+ MAPJOIN(b) */ a.calling calling ,a. total_volume atotal_volume , b.total_volume btotal_volume from ra_md_syn a join ra_ocs_syn b**** on (a.calling = b.calling) ) c where c.atotal_volumn<>c.btotal_volume ; 在 2012年4月6日 上午9:19,binhnt22 <[EMAIL PROTECTED]>写道: > Hi Bejoy,**** > > ** ** > > Sorry for late response. I will start to demonstrate over again to clear > some information.**** > > ** ** > > I have 2 tables, nearly same. Both has the same table structure, 65m > records, 2GB size (same size).**** > > hive> describe ra_md_syn;**** > > OK**** > > calling string**** > > total_duration bigint**** > > total_volume bigint**** > > total_charge bigint**** > > ** ** > > Both of them were bucketized into 256 buckets on ‘calling�� column (in the > last time only 10 buckets, I tried to increase it as you suggested). And I > want to find all ‘calling’ exists in both tables but different > ‘total_volume’**** > > The script as you knew:**** > > ** ** > > hive> set hive.optimize.bucketmapjoin = true;**** > > hive> set hive.enforce.bucketing=true;**** > > hive> set > hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;**** > > hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b**** > > > on (a.calling = b.calling) where a.total_volume <> b.total_volume;* > *** > > ** ** > > And the result was exactly in my last email. Java heap space error. With > total size is only 2GB and 256 buckets, I think bucket size is impossible > to be the issue here.**** > > ** ** > > Please give me some advice, I really appreciate**** > > *Best regards* > > Nguyen Thanh Binh (Mr)**** > > Cell phone: (+84)98.226.0622**** > > ** ** > > *From:* Bejoy Ks [mailto:[EMAIL PROTECTED]] > *Sent:* Thursday, April 05, 2012 7:23 PM > > *To:* [EMAIL PROTECTED] > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > ** ** > > Hi Binh**** > > ** ** > > I was just checking your local map join log , and I noticed two things > **** > > - the memory usage by one hash table has got beyond 1G. **** > > - Number of rows processed is just 2M**** > > ** ** > > It is possible that, Each bucket it self is too large to be loaded > in memory.**** > > ** ** > > As a work around or to nail down the bucket size is the issue here, can > you try increasing the number of buckets to 100 and try doing a bucketed > map join.**** > > ** ** > > Also you mentioned the data size is 2Gb, is it the compressed data size?** > ** > > ** ** > > 2012-04-05 10:41:07 Processing rows: 2,900,000 Hashtable size: > 2899999 Memory usage: 1,062,065,576 rate: 0.76**** > > ** ** > > Regards**** > > Bejoy KS**** > > ** ** > > ** ** > > ** ** > ------------------------------ > > *From:* Nitin Pawar <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Thursday, April 5, 2012 5:03 PM > *Subject:* Re: Why BucketJoinMap consume too much memory**** > > > > **** > > Can you tell me the size of table b? **** > > ** ** > > If you are doing bucketing and still size b table is huge then it will > reach this problem**** > > On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <[EMAIL PROTECTED]> wrote:* > *** > > Thank Nitin,**** > > **** > > I tried but no luck. Here’s hive log, please spend a little time to view > it.**** > > **** > > hive> set hive.optimize.bucketmapjoin = true;**** > > hive> set hive.enforce.bucketing=true;**** > > hive> set > hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;**** > > hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b**** > > > on (a.calling = b.calling) where a.total_volume <> b.total_volume;* > *** > > Total MapReduce jobs = 1**** > > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please > use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties > files.**** > > Execution log at: > /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log
-
Re: Why BucketJoinMap consume too much memoryBejoy Ks 2012-04-06, 16:33
Hi Binh,
From the information you provided bucketed map join should be posible. I'm clueless now, but still I can make one more try, if you could provide me the output of the following 1) Describe Formatted ra_md_syn; 2) Describe Formatted ra_ocs_syn. 3) hadoop fs -du <hdfs location of ra_ocs_syn > 4) hadoop fs -du <hdfs location of ra_md_syn > 5) perform the join and paste the full console log along with the query. (with all the properties set at CLI) 6) your hive-site.xml @Alex You can use non equality conditions in the where clause. Only the ON conditions should be equality ones. Regards Bejoy KS ________________________________ From: gemini alex <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Bejoy Ks <[EMAIL PROTECTED]> Sent: Friday, April 6, 2012 12:36 PM Subject: Re: Why BucketJoinMap consume too much memory I guess the problem is you can't using <> predicate in bucket join, try to select c.* from ( select /*+ MAPJOIN(b) */ a.calling calling ,a. total_volume atotal_volume , b.total_volume btotal_volume from ra_md_syn a join ra_ocs_syn b on (a.calling = b.calling) ) c where c.atotal_volumn<>c.btotal_volume ; 在 2012年4月6日 ��午9:19,binhnt22 <[EMAIL PROTECTED]>写��: Hi Bejoy, > >Sorry for late response. I will start to demonstrate over again to clear some information. > >I have 2 tables, nearly same. Both has the same table structure, 65m records, 2GB size (same size). >hive> describe ra_md_syn; >OK >calling string >total_duration bigint >total_volume �� bigint >total_charge bigint > >Both of them were bucketized into 256 buckets on ‘calling’ column (in the last time only 10 buckets, I tried to increase it as you suggested). And I want to find all ‘calling’ exists in both tables but different ‘total_volume’ >The script as you knew: > >hive> set hive.optimize.bucketmapjoin = true; >hive> set hive.enforce.bucketing=true; >hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; >hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b > > on (a.calling = b.calling) where a.total_volume <> b.total_volume; > >And the result was exactly in my last email. Java heap space error. With total size is only 2GB and 256 buckets, I think bucket size is impossible to be the issue here. > >Please give me some advice, I really appreciate >Best regards >Nguyen Thanh Binh (Mr) >Cell phone: (+84)98.226.0622 > >From:Bejoy Ks [mailto:[EMAIL PROTECTED]] >Sent: Thursday, April 05, 2012 7:23 PM > >To: [EMAIL PROTECTED] >Subject: Re: Why BucketJoinMap consume too much memory > >Hi Binh > > I was just checking your local map join log , and I noticed two things >- the memory usage by one hash table has got beyond 1G. >- Number of rows processed is just 2M > >It is possible that, Each bucket it self is too large to be loaded in memory. > >As a work around or to nail down the bucket size is the issue here, can you try increasing the number of buckets to 100 and try doing a bucketed map join. > >Also you mentioned the data size is 2Gb, is it the compressed data size? > >2012-04-05 10:41:07 Processing rows:�� 2,900,000 Hashtable size: 2899999 Memory usage: 1,062,065,576 rate: 0.76 > >Regards >Bejoy KS > > > > >________________________________ > >From:Nitin Pawar <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Thursday, April 5, 2012 5:03 PM >Subject: Re: Why BucketJoinMap consume too much memory > > > >Can you tell me the size of table b? > >If you are doing bucketing and still size b table is huge then it will reach this problem >On Thu, Apr 5, 2012 at 4:22 PM, binhnt22 <[EMAIL PROTECTED]> wrote: >Thank Nitin, > >I tried but no luck. Here’s hive log, please spend a little time to view it. > >hive> set hive.optimize.bucketmapjoin = true; >hive> set hive.enforce.bucketing=true; >hive> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; >hive> select /*+ MAPJOIN(b) */ * from ra_md_syn a join ra_ocs_syn b a.total_volume <> b.total_volume; deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. /tmp/hduser/hduser_20120405103737_28ef26fe-a202-4047-b5ca-c40d9e3ad36c.log local task to process map join;�� maximum memory 1398145024 rows: 200000 Hashtable size: 199999 Memory usage: 75403880 rate:�� 0.054 rows: 300000 Hashtable size: 299999 Memory usage: 111404664 rate: 0.08 rows: �� 400000 Hashtable size: 399999 Memory usage: 151598960 rate: 0.108 rows: 500000 Hashtable size: 499999 Memory usage: 185483368 rate: 0.133 rows: 600000 Hashtable size: 599999 Memory usage: 221483392 rate: 0.158 rows: �� 700000 Hashtable size: 699999 Memory usage: 257482640 rate: 0.184 rows: 800000 Hashtable size: 799999 Memory usage: 297676944 rate: 0.213 rows: 900000�� Hashtable size: 899999 Memory usage: 333676968�� rate: 0.239 rows: �� 1000000 Hashtable size: 999999 Memory usage: 369676944 rate: 0.264 rows: 1100000 Hashtable size: 1099999 Memory usage: 405676968 rate: 0.29 rows: 1200000 Hashtable size: 1199999 Memory usage: 441676944 rate: 0.316 rows: 1300000 Hashtable size: 1299999 Memory usage: 477676944 rate: 0.342 rows: 1400000 Hashtable size: 1399999 Memory usage: 513676968 rate: 0.367 rows: �� 1500000 Hashtable size: 1499999 Memory usage: 549676944 rate: �� 0.393 rows: 1600000 Hashtable size: 1599999 Memory usage: 602454200 rate: 0.431 1700000 Hashtable size: 1699999 Memory usage: 630065552 �� rate: 0.451 rows: 1800000 Hashtable size: 1799999 Memory usage: 666065552 rate: 0.476 rows: �� 1900000 Hashtable size: 1899999 Memory usage: 702065552 rate: 0.502 rows: 2000000 Hashtable size: 1999999 Memory usage:
-
RE: Why BucketJoinMap consume too much memorybinhnt22 2012-04-09, 03:12
Hi Bejoy,
Thank you for helping me. Here is the information 1. Describe Formatted ra_md_syn; # col_name data_type comment calling string None total_duration bigint None total_volume bigint None total_charge bigint None # Detailed Table Information Database: default Owner: hduser CreateTime: Thu Apr 05 09:48:29 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:54310/user/hive/warehouse/ra_md_syn Table Type: MANAGED_TABLE Table Parameters: numFiles 8 numPartitions 0 numRows 0 rawDataSize 0 totalSize 1872165483 transient_lastDdlTime 1333595095 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm at Compressed: No Num Buckets: 256 Bucket Columns: [calling] Sort Columns: [] Storage Desc Params: serialization.format 1 2. Describe Formatted ra_ocs_syn; # col_name data_type comment calling string None total_duration bigint None total_volume bigint None total_charge bigint None # Detailed Table Information Database: default Owner: hduser CreateTime: Thu Apr 05 09:48:24 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:54310/user/hive/warehouse/ra_ocs_syn Table Type: MANAGED_TABLE Table Parameters: numFiles 8 numPartitions 0 numRows 0 rawDataSize 0 totalSize 1872162225 transient_lastDdlTime 1333595512 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [calling] Sort Columns: [] Storage Desc Params: serialization.format 1 3. hadoop fs -du <hdfs location of ra_md_syn > [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn Found 8 items 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1 269949415 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0 262439067 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0 205767721 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0 24516526 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0 4. hadoop fs -du <hdfs location of ra_ocs_syn > [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn Found 8 items 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0 304959363 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0 274374381 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1 264694474 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0 257498693 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1 100334604 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0 41022170 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0 5. hive-site.xml [hduser@master ~]$ cat hive-0.8.1/conf/hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.local</name> <value>true</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hadoop</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hadoop</value> </property> <property> <name>hive.metastore.sasl.enabled</name> <value>true</value> <description>If true, the metastore thrift interface will be secured with SASL. Clients must authenticate with Kerberos.</description> </property> <property> <name>hive.metastore.kerberos.keytab.file</name> <value></value> <description>The path to the Kerberos Keytab file containing the metastore thrift server's service principal.</description> </property> <property> <name>hive.metastore.kerberos.principal</name> <value>hduser/[EMAIL PROTECTED]</value> <description>The service principal for the metastore thrift server. The special string _HOST will be replaced automatically with the correct host name.</de
-
Re: Why BucketJoinMap consume too much memoryBejoy Ks 2012-04-09, 14:48
Hi Binh
It is just an issue with the number of buckets. Your tables has just 8 buckets, as there only 8 files are seen the storage directory. You might have just issued an ALTER TABLE script on an existing bucketed table. The work around here is 1) You need to wipe and reload the tables with hive.enforce.bucketing=true; �� Ensure your storage directory as that many files as the number of buckets. As per your table DDL you should see 256 files. 2) Enable hive.optimize.bucketmapjoin = true; and try doing the join again. It should definitely work. Regards Bejoy KS ________________________________ From: binhnt22 <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: 'Bejoy Ks' <[EMAIL PROTECTED]> Sent: Monday, April 9, 2012 8:42 AM Subject: RE: Why BucketJoinMap consume too much memory Hi Bejoy, Thank you for helping me. Here is the information 1. Describe Formatted ra_md_syn; # col_name �� data_type comment calling �� string �� None total_duration bigint�� None total_volume bigint None total_charge �� bigint �� None # Detailed Table Information Database: default Owner: �� hduser CreateTime: Thu Apr 05 09:48:29 ICT 2012 LastAccessTime:�� UNKNOWN Protect Mode: None Retention:�� 0 Location: hdfs://master:54310/user/hive/warehouse/ra_md_syn Table Type: MANAGED_TABLE Table Parameters: numFiles 8 �� numPartitions �� 0 numRows�� 0 rawDataSize 0 totalSize 1872165483 transient_lastDdlTime 1333595095 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm �� at Compressed: No Num Buckets: �� 256 Bucket Columns: [calling] Sort Columns: [] Storage Desc Params: serialization.format �� 1 2. Describe Formatted ra_ocs_syn; # col_name �� data_type comment calling �� string �� None total_duration bigint�� None total_volume bigint None total_charge �� bigint �� None # Detailed Table Information Database: default Owner: �� hduser CreateTime: Thu Apr 05 09:48:24 ICT 2012 LastAccessTime:�� UNKNOWN Protect Mode: None Retention:�� 0 Location: hdfs://master:54310/user/hive/warehouse/ra_ocs_syn Table Type: �� MANAGED_TABLE Table Parameters: �� numFiles 8 �� numPartitions �� 0 numRows 0 rawDataSize 0 totalSize 1872162225 transient_lastDdlTime 1333595512 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: �� org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: �� No Num Buckets: 256 Bucket Columns: �� [calling] Sort Columns: �� [] Storage Desc Params: serialization.format 1 3. hadoop fs -du <hdfs location of ra_md_syn > [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn Found 8 items 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0 274374970�� hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1 269949415 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0 262439067 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0 205767721 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0 24516526 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0 4. �� hadoop fs -du <hdfs location of ra_ocs_syn��> [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn Found 8 items 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0 304959363 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0 274374381 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1 264694474 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0 257498693 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1 100334604 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0 41022170 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0 5. hive-site.xml [hduser@master ~]$ cat hive-0.8.1/conf/hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.local</name> �� <value>true</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> �� <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> �� <value>hadoop</value> �� </property> �� <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hadoop</value> </property> <property> <name>hive.metastore.sasl.enabled</name> <value>true</value>
-
RE: Why BucketJoinMap consume too much memorybinhnt22 2012-04-10, 02:40
Hi Bejoy,
It worked like a charm. Thank you very much. I really really appreciate your help. This bucket join should be used with 1 big table and 1 small table. If both table are big, the join time would be much more than normal join. Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Monday, April 09, 2012 9:49 PM To: [EMAIL PROTECTED] Subject: Re: Why BucketJoinMap consume too much memory Hi Binh It is just an issue with the number of buckets. Your tables has just 8 buckets, as there only 8 files are seen the storage directory. You might have just issued an ALTER TABLE script on an existing bucketed table. The work around here is 1) You need to wipe and reload the tables with hive.enforce.bucketing=true; Ensure your storage directory as that many files as the number of buckets. As per your table DDL you should see 256 files. 2) Enable hive.optimize.bucketmapjoin = true; and try doing the join again. It should definitely work. Regards Bejoy KS _____ From: binhnt22 <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: 'Bejoy Ks' <[EMAIL PROTECTED]> Sent: Monday, April 9, 2012 8:42 AM Subject: RE: Why BucketJoinMap consume too much memory Hi Bejoy, Thank you for helping me. Here is the information 1. Describe Formatted ra_md_syn; # col_name data_type comment calling string None total_duration bigint None total_volume bigint None total_charge bigint None # Detailed Table Information Database: default Owner: hduser CreateTime: Thu Apr 05 09:48:29 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:54310/user/hive/warehouse/ra_md_syn Table Type: MANAGED_TABLE Table Parameters: numFiles 8 numPartitions 0 numRows 0 rawDataSize 0 totalSize 1872165483 transient_lastDdlTime 1333595095 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm at Compressed: No Num Buckets: 256 Bucket Columns: [calling] Sort Columns: [] Storage Desc Params: serialization.format 1 2. Describe Formatted ra_ocs_syn; # col_name data_type comment calling string None total_duration bigint None total_volume bigint None total_charge bigint None # Detailed Table Information Database: default Owner: hduser CreateTime: Thu Apr 05 09:48:24 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:54310/user/hive/warehouse/ra_ocs_syn Table Type: MANAGED_TABLE Table Parameters: numFiles 8 numPartitions 0 numRows 0 rawDataSize 0 totalSize 1872162225 transient_lastDdlTime 1333595512 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: 256 Bucket Columns: [calling] Sort Columns: [] Storage Desc Params: serialization.format 1 3. hadoop fs -du <hdfs location of ra_md_syn > [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn Found 8 items 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1 269949415 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0 262439067 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0 205767721 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0 24516526 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0 4. hadoop fs -du <hdfs location of ra_ocs_syn > [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn Found 8 items 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0 304959363 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0 274374381 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1 264694474 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0 257498693 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000005_1 100334604 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000006_0 41022170 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000007_0 5. hive-site.xml [hduser@master ~]$ cat hive-0.8.1/conf/hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.metastore.local</name> <value>true</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createD
-
Re: Why BucketJoinMap consume too much memoryBejoy Ks 2012-04-10, 15:43
Hi Binh
You are right, here both of your tables are of the same size. And loading 2GB od data into hash tables and then to temp files and so on would take some time. This time becomes negligible if it was like, one table was of 2GB and other of 2TB, then you'll notice the wide difference in performance between a common join and bucketed map join. If one of the table is too small map join would be good, if it is of moderate size then bucketed map join. Regards Bejoy KS ________________________________ From: binhnt22 <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: 'Bejoy Ks' <[EMAIL PROTECTED]> Sent: Tuesday, April 10, 2012 8:10 AM Subject: RE: Why BucketJoinMap consume too much memory Hi Bejoy, It worked like a charm. Thank you very much. I really really appreciate your help. This bucket join should be used with 1 big table and 1 small table. If both table are big, the join time would be much more than normal join. Best regards Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 From:Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Monday, April 09, 2012 9:49 PM To: [EMAIL PROTECTED] Subject: Re: Why BucketJoinMap consume too much memory Hi Binh It is just an issue with the number of buckets. Your tables has just 8 buckets, as there only 8 files are seen the storage directory. You might have just issued an ALTER TABLE script on an existing bucketed table. The work around here is 1) You need to wipe and reload the tables with hive.enforce.bucketing=true; Ensure your storage directory as that many files as the number of buckets. As per your table DDL you should see 256 files. 2) Enable hive.optimize.bucketmapjoin = true; and try doing the join again. It should definitely work. Regards Bejoy KS ________________________________ From:binhnt22 <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: 'Bejoy Ks' <[EMAIL PROTECTED]> Sent: Monday, April 9, 2012 8:42 AM Subject: RE: Why BucketJoinMap consume too much memory Hi Bejoy, Thank you for helping me. Here is the information 1. Describe Formatted ra_md_syn; # col_name �� data_type �� comment calling string �� None total_duration bigint None total_volume bigint None total_charge �� bigint �� None # Detailed Table Information Database:�� default Owner: hduser CreateTime: Thu Apr 05 09:48:29 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: �� None Retention: 0 Location: hdfs://master:54310/user/hive/warehouse/ra_md_syn Table Type: �� MANAGED_TABLE Table Parameters: �� numFiles 8 numPartitions �� 0 numRows�� 0 rawDataSize �� 0 totalSize �� 1872165483 transient_lastDdlTime 1333595095 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: �� org.apache.hadoop.mapred.TextInputFormat OutputFormat: �� org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm �� at Compressed: �� No Num Buckets: 256 Bucket Columns: [calling] Sort Columns: �� [] Storage Desc Params: serialization.format 1 2. Describe Formatted ra_ocs_syn; # col_name �� data_type �� comment calling string �� None total_duration bigint None total_volume bigint None total_charge �� bigint �� None # Detailed Table Information Database: default Owner: hduser CreateTime: Thu Apr 05 09:48:24 ICT 2012 LastAccessTime: UNKNOWN Protect Mode: �� None Retention: 0 Location: hdfs://master:54310/user/hive/warehouse/ra_ocs_syn Table Type: �� MANAGED_TABLE Table Parameters: �� numFiles 8 �� numPartitions 0 numRows �� 0 rawDataSize 0 totalSize �� 1872162225 transient_lastDdlTime 1333595512 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: �� org.apache.hadoop.mapred.TextInputFormat OutputFormat: �� org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: �� 256 Bucket Columns: [calling] Sort Columns: �� [] Storage Desc Params: serialization.format �� 1 3. hadoop fs -du <hdfs location of ra_md_syn > [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_md_syn Found 8 items 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000000_1 280371407 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000001_0 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000002_1 274374970 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000003_1 269949415 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000004_0 262439067 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000005_0 205767721 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000006_0 24516526 hdfs://master:54310/user/hive/warehouse/ra_md_syn/000007_0 4. hadoop fs -du <hdfs location of ra_ocs_syn > [hduser@master hadoop-0.20.203.0]$ bin/hadoop fs -du /user/hive/warehouse/ra_ocs_syn Found 8 items 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000000_0 314639270 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000001_0 304959363 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000002_0 274374381 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000003_1 264694474 hdfs://master:54310/user/hive/warehouse/ra_ocs_syn/000004_0 257 |