|
Ruben de Vries
2012-04-23, 15:19
Tucker, Matt
2012-04-23, 15:30
Bejoy KS
2012-04-23, 15:31
Ruben de Vries
2012-04-23, 15:38
Bejoy KS
2012-04-23, 16:03
Bejoy KS
2012-04-23, 16:05
Edward Capriolo
2012-04-23, 16:09
Ruben de Vries
2012-04-23, 16:13
Bejoy KS
2012-04-23, 16:39
Mark Grover
2012-04-24, 01:16
Ruben de Vries
2012-04-24, 06:58
Nitin Pawar
2012-04-24, 07:16
Bejoy Ks
2012-04-24, 07:46
Ruben de Vries
2012-04-24, 08:36
Nitin Pawar
2012-04-24, 09:45
Ruben de Vries
2012-04-24, 11:28
Bejoy Ks
2012-04-24, 13:57
Ruben de Vries
2012-04-24, 16:09
gemini alex
2012-04-25, 07:36
gemini alex
2012-04-25, 07:40
Ruben de Vries
2012-04-25, 07:48
Mark Grover
2012-04-26, 00:59
gemini alex
2012-04-26, 03:37
Ruben de Vries
2012-04-26, 07:16
Ruben de Vries
2012-04-26, 09:06
|
-
When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-23, 15:19
It seems there's enough information to be found on how to setup and use partitions and buckets.
But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
RE: When/how to use partitions and buckets usefully?Tucker, Matt 2012-04-23, 15:30
If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution.
If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 11:19 AM To: [EMAIL PROTECTED] Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
Re: When/how to use partitions and buckets usefully?Bejoy KS 2012-04-23, 15:31
Partitions are good when you want to run your queries on a subset of whole data. So the partition column depends on your queries. But a good point to be taken care is that every partition have enough data.
Partition gets into effect when you use filters with Where clause. Buckets are good for sampling and joins like bucketed map joins. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Ruben de Vries <[EMAIL PROTECTED]> Date: Mon, 23 Apr 2012 17:19:00 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-23, 15:38
Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;)
The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 5:30 PM To: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Monday, April 23, 2012 11:19 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
Re: When/how to use partitions and buckets usefully?Bejoy KS 2012-04-23, 16:03
Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years?
The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Ruben de Vries <[EMAIL PROTECTED]> Date: Mon, 23 Apr 2012 17:38:20 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 5:30 PM To: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Monday, April 23, 2012 11:19 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
Re: When/how to use partitions and buckets usefully?Bejoy KS 2012-04-23, 16:05
For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other.
Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: "Bejoy KS" <[EMAIL PROTECTED]> Date: Mon, 23 Apr 2012 16:03:34 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Ruben de Vries <[EMAIL PROTECTED]> Date: Mon, 23 Apr 2012 17:38:20 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 5:30 PM To: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Monday, April 23, 2012 11:19 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
Re: When/how to use partitions and buckets usefully?Edward Capriolo 2012-04-23, 16:09
There are many good reasons to use bucketing. I have these rules for
when to dump partitioning in favor of bucketing: 1) too many partitions a day (500 + partitions a day, # of file issues) 2) an unpredictable number of partitions per day. You can weight these factors with other benefits buckets offer. Edward On Mon, Apr 23, 2012 at 12:05 PM, Bejoy KS <[EMAIL PROTECTED]> wrote: > For Bucketed map join, both tables should be bucketed and the number of > buckets of one should be multiple of other. > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ________________________________ > From: "Bejoy KS" <[EMAIL PROTECTED]> > Date: Mon, 23 Apr 2012 16:03:34 +0000 > To: <[EMAIL PROTECTED]> > ReplyTo: [EMAIL PROTECTED] > Subject: Re: When/how to use partitions and buckets usefully? > > Bucketed map join would be good I guess. What is the total size of the > smaller table and what is its expected size in the next few years? > > The size should be good enough to be put in Distributed Cache, then map side > joins would offer you much performance improvement. > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ________________________________ > From: Ruben de Vries <[EMAIL PROTECTED]> > Date: Mon, 23 Apr 2012 17:38:20 +0200 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> > ReplyTo: [EMAIL PROTECTED] > Subject: RE: When/how to use partitions and buckets usefully? > > Ok, very clear on the partitions, try to make them match the WHERE clauses, > not so much about group clauses then ;) > > > > The member_map contains 11.636.619 records atm, I think bucketing those > would be good? > > What’s a good number to bucket them by then? > > > > And is there any point in bucketing the visit_stats? > > > > From: Tucker, Matt [mailto:[EMAIL PROTECTED]] > Sent: Monday, April 23, 2012 5:30 PM > To: [EMAIL PROTECTED] > Subject: RE: When/how to use partitions and buckets usefully? > > > > If you’re only interested in a certain window of dates for analysis, a > date-based partition scheme will be helpful, as it will trim partitions that > aren’t needed by the query before execution. > > > > If the member_map table is small, you might consider testing the feasibility > of map-side joins, as it will reduce the number of processing stages. If > member_map is large, bucketing on member_id will avoid having as many rows > from visit_stats compared to each member_id for joins. > > > > Matt Tucker > > > > From: Ruben de Vries [mailto:[EMAIL PROTECTED]] > Sent: Monday, April 23, 2012 11:19 AM > To: [EMAIL PROTECTED] > Subject: When/how to use partitions and buckets usefully? > > > > It seems there’s enough information to be found on how to setup and use > partitions and buckets. > > But I’m more interested in how to figure out when and what columns you > should be partitioning and bucketing to increase performance?! > > > > In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols > which give me info about the visits) and 1 member_map (member_id, gender, > age). > > > > Usually I group by date and then one of the other col so I assume that > partitioning on date is a good start?! > > > > It seems the join of the member_map onto the visit_stats makes the queries a > lot slower, can that be fixed by bucketing both tables? Or just one of them? > > > > Maybe some ppl have written good blogs on this subject but I can’t really > seem to find them!? > > > > Any help would be appreciated, thanks in advance J
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-23, 16:13
Thanks for the help so far guys,
I bucketed the members_map, it's 330mb in size (11 mil records). Can you manually bucket stuff? Since my initial mapreduce job is still outside of Hive I'm doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down a lot From: Bejoy KS [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 6:06 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other. Regards Bejoy KS Sent from handheld, please excuse typos. ________________________________ From: "Bejoy KS" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Date: Mon, 23 Apr 2012 16:03:34 +0000 To: <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: When/how to use partitions and buckets usefully? Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Date: Mon, 23 Apr 2012 17:38:20 +0200 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]%[EMAIL PROTECTED]>> ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: When/how to use partitions and buckets usefully? Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Monday, April 23, 2012 5:30 PM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: When/how to use partitions and buckets usefully? If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Monday, April 23, 2012 11:19 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
Re: When/how to use partitions and buckets usefully?Bejoy KS 2012-04-23, 16:39
If data is in hdfs, then you can bucket it only after loading into a temp/staging table and then to the final bucketed table. Bucketing needs a Map reduce job.
Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Ruben de Vries <[EMAIL PROTECTED]> Date: Mon, 23 Apr 2012 18:13:20 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>; [EMAIL PROTECTED]<[EMAIL PROTECTED]> Subject: RE: When/how to use partitions and buckets usefully? Thanks for the help so far guys, I bucketed the members_map, it's 330mb in size (11 mil records). Can you manually bucket stuff? Since my initial mapreduce job is still outside of Hive I'm doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down a lot From: Bejoy KS [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 6:06 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other. Regards Bejoy KS Sent from handheld, please excuse typos. ________________________________ From: "Bejoy KS" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Date: Mon, 23 Apr 2012 16:03:34 +0000 To: <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: When/how to use partitions and buckets usefully? Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Date: Mon, 23 Apr 2012 17:38:20 +0200 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]%[EMAIL PROTECTED]>> ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: When/how to use partitions and buckets usefully? Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Monday, April 23, 2012 5:30 PM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: When/how to use partitions and buckets usefully? If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:[EMAIL PROTECTED]]<mailto:[mailto:[EMAIL PROTECTED]]> Sent: Monday, April 23, 2012 11:19 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: When/how to use partitions and buckets usefully? It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols which give me info about the visits) and 1 member_map (member_id, gender, age). Usually I group by date and then one of the other col so I assume that partitioning on date is a good start?! It seems the join of the member_map onto the visit_stats makes the queries a lot slower, can that be fixed by bucketing both tables? Or just one of them? Maybe some ppl have written good blogs on this subject but I can't really seem to find them!? Any help would be appreciated, thanks in advance :)
-
Re: When/how to use partitions and buckets usefully?Mark Grover 2012-04-24, 01:16
Hi Ruben,
Like Bejoy pointed out, members_map is small enough to fit in memory, so your joins with visit_stats would be much faster with map-side join. However, there is still some virtue in bucketing visit_stats. Bucketing can optimize joins, group by's and potentially other queries in certain circumstances. You probably want to keep consistent bucketing columns across all your tables so they can leveraged in multi-table queries. Most people use some power of 2 as their number of buckets. To make the best use of the buckets, each of your buckets should be able to entirely load into memory on the node. I use something close the formula below to calculate the number of buckets: #buckets = (x * Average_partition_size) / JVM_memory_available_to_your_Hadoop_tasknode I call x (>1) the "factor of conservatism". Higher x means you are being more conservative by having larger number of buckets (and bearing the increased overhead), lower x means the reverse. What x to use would depend on your use case. This is because the number of buckets in a table is fixed. If you have a large partition, it would distribute it's data into bulkier buckets and you would want to make sure these bulkier buckets can still fit in memory. Moreover, buckets are generated using a hashing function, if you have a strong bias towards a particular value of bucketing column in your data, some buckets might be bulkier than others. In that case, you'd want to make sure that those bulkier buckets can still fit in memory. To summarize, it depends on: * How the actual partition sizes vary from the average partition size (i.e. the standard deviation of your partition size). More standard deviations means you should be more conservative in your calculation and vice-versa. * Distribution of the data in the bucketing columns. "Wider" distribution means you should be more conservative and vice-versa. Long story short, I would say, x of 2 to 4 should suffice in most cases but feel free to verify that in your case:-) I would love to hear what factors others have been using when calculating their number of buckets, BTW! Whatever answer you get for #buckets from above formula, use the closest power of 2 as the number of buckets in your table (I am not sure if this is a must, though). Good luck! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Bejoy KS" <[EMAIL PROTECTED]> To: "Ruben de Vries" <[EMAIL PROTECTED]>, [EMAIL PROTECTED] Sent: Monday, April 23, 2012 12:39:17 PM Subject: Re: When/how to use partitions and buckets usefully? If data is in hdfs, then you can bucket it only after loading into a temp/staging table and then to the final bucketed table. Bucketing needs a Map reduce job. Regards Bejoy KS Sent from handheld, please excuse typos. From: Ruben de Vries <[EMAIL PROTECTED]> Date: Mon, 23 Apr 2012 18:13:20 +0200 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>; [EMAIL PROTECTED]<[EMAIL PROTECTED]> Subject: RE: When/how to use partitions and buckets usefully? Thanks for the help so far guys, I bucketed the members_map, it’s 330mb in size (11 mil records). Can you manually bucket stuff? Since my initial mapreduce job is still outside of Hive I’m doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down a lot From: Bejoy KS [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 6:06 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other. Regards Bejoy KS Sent from handheld, please excuse typos. From: "Bejoy KS" < [EMAIL PROTECTED] > Date: Mon, 23 Apr 2012 16:03:34 +0000 To: < [EMAIL PROTECTED] > ReplyTo: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. From: Ruben de Vries < [EMAIL PROTECTED] > Date: Mon, 23 Apr 2012 17:38:20 +0200 To: [EMAIL PROTECTED]<[EMAIL PROTECTED] > ReplyTo: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What’s a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 5:30 PM To: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? If you’re only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren’t needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_stats compared to each member_id for joins. Matt Tucker From: Ruben de Vries [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 11:19 AM To: [EMAIL PROTECTED] Subject: When/how to use partitions and buckets usefully? It seems there’s enough information to be found on how to setup and use partitions
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-24, 06:58
Wow thanks everyone for the nice feedback!
I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? Cheers, Ruben de Vries -----Original Message----- From: Mark Grover [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 3:17 AM To: [EMAIL PROTECTED]; bejoy ks Cc: Ruben de Vries Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben, Like Bejoy pointed out, members_map is small enough to fit in memory, so your joins with visit_stats would be much faster with map-side join. However, there is still some virtue in bucketing visit_stats. Bucketing can optimize joins, group by's and potentially other queries in certain circumstances. You probably want to keep consistent bucketing columns across all your tables so they can leveraged in multi-table queries. Most people use some power of 2 as their number of buckets. To make the best use of the buckets, each of your buckets should be able to entirely load into memory on the node. I use something close the formula below to calculate the number of buckets: #buckets = (x * Average_partition_size) / JVM_memory_available_to_your_Hadoop_tasknode I call x (>1) the "factor of conservatism". Higher x means you are being more conservative by having larger number of buckets (and bearing the increased overhead), lower x means the reverse. What x to use would depend on your use case. This is because the number of buckets in a table is fixed. If you have a large partition, it would distribute it's data into bulkier buckets and you would want to make sure these bulkier buckets can still fit in memory. Moreover, buckets are generated using a hashing function, if you have a strong bias towards a particular value of bucketing column in your data, some buckets might be bulkier than others. In that case, you'd want to make sure that those bulkier buckets can still fit in memory. To summarize, it depends on: * How the actual partition sizes vary from the average partition size (i.e. the standard deviation of your partition size). More standard deviations means you should be more conservative in your calculation and vice-versa. * Distribution of the data in the bucketing columns. "Wider" distribution means you should be more conservative and vice-versa. Long story short, I would say, x of 2 to 4 should suffice in most cases but feel free to verify that in your case:-) I would love to hear what factors others have been using when calculating their number of buckets, BTW! Whatever answer you get for #buckets from above formula, use the closest power of 2 as the number of buckets in your table (I am not sure if this is a must, though). Good luck! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Bejoy KS" <[EMAIL PROTECTED]> To: "Ruben de Vries" <[EMAIL PROTECTED]>, [EMAIL PROTECTED] Sent: Monday, April 23, 2012 12:39:17 PM Subject: Re: When/how to use partitions and buckets usefully? If data is in hdfs, then you can bucket it only after loading into a temp/staging table and then to the final bucketed table. Bucketing needs a Map reduce job. Regards Bejoy KS Sent from handheld, please excuse typos. From: Ruben de Vries <[EMAIL PROTECTED]> Date: Mon, 23 Apr 2012 18:13:20 +0200 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>; [EMAIL PROTECTED]<[EMAIL PROTECTED]> Subject: RE: When/how to use partitions and buckets usefully? Thanks for the help so far guys, I bucketed the members_map, it’s 330mb in size (11 mil records). Can you manually bucket stuff? Since my initial mapreduce job is still outside of Hive I’m doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT slows it down a lot From: Bejoy KS [mailto:[EMAIL PROTECTED]] SentTo: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other. Regards Bejoy KS Sent from handheld, please excuse typos. From: "Bejoy KS" < [EMAIL PROTECTED] > Date: Mon, 23 Apr 2012 16:03:34 +0000 To: < [EMAIL PROTECTED] > ReplyTo: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from handheld, please excuse typos. From: Ruben de Vries < [EMAIL PROTECTED] > Date: Mon, 23 Apr 2012 17:38:20 +0200 To: [EMAIL PROTECTED]<[EMAIL PROTECTED] > ReplyTo: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What’s a good number to bucket them by then? And is there any point in bucketing the visit_stats? From: Tucker, Matt [mailto:[EMAIL PROTECTED]] Sent: Monday, April 23, 2012 5:30 PM To: [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? If you’re only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren’t needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it will reduce the number of processing stages. If member_map is large, bucketing on member_id will avoid having as many rows from visit_st
-
Re: When/how to use partitions and buckets usefully?Nitin Pawar 2012-04-24, 07:16
If you are doing a map side join make sure the table members_map is
small enough to hold in memory On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote: > Wow thanks everyone for the nice feedback! > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? > > > Cheers, > > Ruben de Vries > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 3:17 AM > To: [EMAIL PROTECTED]; bejoy ks > Cc: Ruben de Vries > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben, > Like Bejoy pointed out, members_map is small enough to fit in memory, so > your joins with visit_stats would be much faster with map-side join. > > However, there is still some virtue in bucketing visit_stats. Bucketing can > optimize joins, group by's and potentially other queries in certain > circumstances. > You probably want to keep consistent bucketing columns across all your > tables so they can leveraged in multi-table queries. Most people use some > power of 2 as their number of buckets. To make the best use of the buckets, > each of your buckets should be able to entirely load into memory on the > node. > > I use something close the formula below to calculate the number of buckets: > > #buckets = (x * Average_partition_size) / > JVM_memory_available_to_your_Hadoop_tasknode > > I call x (>1) the "factor of conservatism". Higher x means you are being > more conservative by having larger number of buckets (and bearing the > increased overhead), lower x means the reverse. What x to use would depend > on your use case. This is because the number of buckets in a table is fixed. > If you have a large partition, it would distribute it's data into bulkier > buckets and you would want to make sure these bulkier buckets can still fit > in memory. Moreover, buckets are generated using a hashing function, if you > have a strong bias towards a particular value of bucketing column in your > data, some buckets might be bulkier than others. In that case, you'd want to > make sure that those bulkier buckets can still fit in memory. > > To summarize, it depends on: > * How the actual partition sizes vary from the average partition size (i.e. > the standard deviation of your partition size). More standard deviations > means you should be more conservative in your calculation and vice-versa. > * Distribution of the data in the bucketing columns. "Wider" distribution > means you should be more conservative and vice-versa. > > Long story short, I would say, x of 2 to 4 should suffice in most cases but > feel free to verify that in your case:-) I would love to hear what factors > others have been using when calculating their number of buckets, BTW! > Whatever answer you get for #buckets from above formula, use the closest > power of 2 as the number of buckets in your table (I am not sure if this is > a must, though). > > Good luck! > > Mark > > Mark Grover, Business Intelligence Analyst OANDA Corporation > > www: oanda.com www: fxtrade.com > e: [EMAIL PROTECTED] > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Bejoy KS" <[EMAIL PROTECTED]> > To: "Ruben de Vries" <[EMAIL PROTECTED]>, [EMAIL PROTECTED] > Sent: Monday, April 23, 2012 12:39:17 PM > Subject: Re: When/how to use partitions and buckets usefully? > > If data is in hdfs, then you can bucket it only after loading into a > temp/staging table and then to the final bucketed table. Bucketing needs a > Map reduce job. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > > From: Ruben de Vries <[EMAIL PROTECTED]> > Date: Mon, 23 Apr 2012 18:13:20 +0200 > To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>; > [EMAIL PROTECTED]<[EMAIL PROTECTED]> > Subject: RE: When/how to use partitions and buckets usefully? > > > > > Thanks for the help so far guys, > > > > I bucketed the members_map, it’s 330mb in size (11 mil records). Nitin Pawar
-
Re: When/how to use partitions and buckets usefully?Bejoy Ks 2012-04-24, 07:46
Hi Ruben
Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy ________________________________ From: Nitin Pawar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, April 24, 2012 12:46 PM Subject: Re: When/how to use partitions and buckets usefully? If you are doing a map side join make sure the table members_map is small enough to hold in memory On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote: > Wow thanks everyone for the nice feedback! > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? > > > Cheers, > > Ruben de Vries > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 3:17 AM > To: [EMAIL PROTECTED]; bejoy ks > Cc: Ruben de Vries > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben, > Like Bejoy pointed out, members_map is small enough to fit in memory, so > your joins with visit_stats would be much faster with map-side join. > > However, there is still some virtue in bucketing visit_stats. Bucketing can > optimize joins, group by's and potentially other queries in certain > circumstances. > You probably want to keep consistent bucketing columns across all your > tables so they can leveraged in multi-table queries. Most people use some > power of 2 as their number of buckets. To make the best use of the buckets, > each of your buckets should be able to entirely load into memory on the > node. > > I use something close the formula below to calculate the number of buckets: > > #buckets = (x * Average_partition_size) / > JVM_memory_available_to_your_Hadoop_tasknode > > I call x (>1) the "factor of conservatism". Higher x means you are being > more conservative by having larger number of buckets (and bearing the > increased overhead), lower x means the reverse. What x to use would depend > on your use case. This is because the number of buckets in a table is fixed. > If you have a large partition, it would distribute it's data into bulkier > buckets and you would want to make sure these bulkier buckets can still fit > in memory. Moreover, buckets are generated using a hashing function, if you > have a strong bias towards a particular value of bucketing column in your > data, some buckets might be bulkier than others. In that case, you'd want to > make sure that those bulkier buckets can still fit in memory. > > To summarize, it depends on: > * How the actual partition sizes vary from the average partition size (i.e. > the standard deviation of your partition size). More standard deviations > means you should be more conservative in your calculation and vice-versa. > * Distribution of the data in the bucketing columns. "Wider" distribution > means you should be more conservative and vice-versa. > > Long story short, I would say, x of 2 to 4 should suffice in most cases but > feel free to verify that in your case:-) I would love to hear what factors > others have been using when calculating their number of buckets, BTW! > Whatever answer you get for #buckets from above formula, use the closest > power of 2 as the number of buckets in your table (I am not sure if this is > a must, though). > > Good luck! > > Mark > > Mark Grover, Business Intelligence Analyst OANDA Corporation > > www: oanda.com www: fxtrade.com > e: [EMAIL PROTECTED] > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Bejoy KS" <[EMAIL PROTECTED]> Nitin Pawar
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-24, 08:36
Hmm I must be doing something wrong, the members_map table is 300ish MB.
When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 200000 Hashtable size: 199999 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 6900000 Hashtable size: 6899999 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? -----Original Message----- From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 9:47 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy ________________________________ From: Nitin Pawar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, April 24, 2012 12:46 PM Subject: Re: When/how to use partitions and buckets usefully? If you are doing a map side join make sure the table members_map is small enough to hold in memory On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote: > Wow thanks everyone for the nice feedback! > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? > > > Cheers, > > Ruben de Vries > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 3:17 AM > To: [EMAIL PROTECTED]; bejoy ks > Cc: Ruben de Vries > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben, > Like Bejoy pointed out, members_map is small enough to fit in memory, > so your joins with visit_stats would be much faster with map-side join. > > However, there is still some virtue in bucketing visit_stats. > Bucketing can optimize joins, group by's and potentially other queries > in certain circumstances. > You probably want to keep consistent bucketing columns across all your > tables so they can leveraged in multi-table queries. Most people use > some power of 2 as their number of buckets. To make the best use of > the buckets, each of your buckets should be able to entirely load into > memory on the node. > > I use something close the formula below to calculate the number of buckets: > > #buckets = (x * Average_partition_size) / > JVM_memory_available_to_your_Hadoop_tasknode > > I call x (>1) the "factor of conservatism". Higher x means you are > being more conservative by having larger number of buckets (and > bearing the increased overhead), lower x means the reverse. What x to > use would depend on your use case. This is because the number of buckets in a table is fixed. > If you have a large partition, it would distribute it's data into > bulkier buckets and you would want to make sure these bulkier buckets > can still fit in memory. Moreover, buckets are generated using a > hashing function, if you have a strong bias towards a particular value > of bucketing column in your data, some buckets might be bulkier than Nitin Pawar
-
Re: When/how to use partitions and buckets usefully?Nitin Pawar 2012-04-24, 09:45
This operation is erroring out on the hive client itself before starting a
map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]>wrote: > Hmm I must be doing something wrong, the members_map table is 300ish MB. > When I execute the following query: > > SELECT > /*+ MAPJOIN(members_map) */ > date_int, > members_map.gender AS gender, > 'generic', > COUNT( memberId ) AS unique, > SUM( `generic`['count'] ) AS count, > SUM( `generic`['seconds'] ) AS seconds > FROM visit_stats > JOIN members_map ON(members_map.member_id = visit_stats.memberId) > GROUP BY date_int, members_map.gender > > It results in: > 2012-04-24 10:25:59 Starting to launch local task to process map join; > maximum memory = 1864171520 > 2012-04-24 10:26:00 Processing rows: 200000 Hashtable > size: 199999 Memory usage: 43501848 rate: 0.023 > 2012-04-24 10:30:54 Processing rows: 6900000 Hashtable size: > 6899999 Memory usage: 1449867552 rate: 0.778 > 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: > 6999999 Memory usage: 1468378760 rate: 0.788 > Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space > > > I'm running it only my local, single node, dev env, could that be a > problem since it won't split over multiple mappers in this case? > > > -----Original Message----- > From: Bejoy Ks [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 9:47 AM > To: [EMAIL PROTECTED] > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben > Map join hint is provided to hive using "MAPJOIN" keyword as : > SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key > > To use map side join some hive configuration properties needs to be enabled > > For plain map side joins > hive>SET hive.auto.convert.join=true; > Latest versions of hive does a map join on the smaller table even if no > map join hit is provided. > > For bucketed map joins > hive>SET hive.optimize.bucketmapjoin=true > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins > > > Regards > Bejoy > > > ________________________________ > From: Nitin Pawar <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, April 24, 2012 12:46 PM > Subject: Re: When/how to use partitions and buckets usefully? > > If you are doing a map side join make sure the table members_map is small > enough to hold in memory > > On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote: > > Wow thanks everyone for the nice feedback! > > > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ > right? > > > > > > Cheers, > > > > Ruben de Vries > > > > -----Original Message----- > > From: Mark Grover [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, April 24, 2012 3:17 AM > > To: [EMAIL PROTECTED]; bejoy ks > > Cc: Ruben de Vries > > Subject: Re: When/how to use partitions and buckets usefully? > > > > Hi Ruben, > > Like Bejoy pointed out, members_map is small enough to fit in memory, > > so your joins with visit_stats would be much faster with map-side join. > > > > However, there is still some virtue in bucketing visit_stats. > > Bucketing can optimize joins, group by's and potentially other queries > > in certain circumstances. > > You probably want to keep consistent bucketing columns across all your > > tables so they can leveraged in multi-table queries. Most people use > > some power of 2 as their number of buckets. To make the best use of > > the buckets, each of your buckets should be able to entirely load into > > memory on the node. > > > > I use something close the formula below to calculate the number of > buckets: > > > > #buckets = (x * Average_partition_size) / > > JVM_memory_available_to_your_Hadoop_tasknode > > > > I call x (>1) the "factor of conservatism". Higher x means you are Nitin Pawar
-
FW: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-24, 11:28
Here are both tables:
$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 11:46 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 200000 Hashtable size: 199999 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 6900000 Hashtable size: 6899999 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? -----Original Message----- From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 9:47 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy ________________________________ From: Nitin Pawar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, April 24, 2012 12:46 PM Subject: Re: When/how to use partitions and buckets usefully? If you are doing a map side join make sure the table members_map is small enough to hold in memory On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote: > Wow thanks everyone for the nice feedback! > > I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? > > > Cheers, > > Ruben de Vries > > -----Original Message----- > From: Mark Grover [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 3:17 AM > To: [EMAIL PROTECTED]; bejoy ks > Cc: Ruben de Vries > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben, > Like Bejoy pointed out, members_map is small enough to fit in memory, > so your joins with visit_stats would be much faster with map-side join. > > However, there is still some virtue in bucketing visit_stats. > Bucketing can optimize joins, group by's and potentially other queries Nitin Pawar Nitin Pawar
-
Re: When/how to use partitions and buckets usefully?Bejoy Ks 2012-04-24, 13:57
Hi Ruben
The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 11:46 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 200000 Hashtable size: 199999 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 6900000 Hashtable size: 6899999 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? -----Original Message----- From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 9:47 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy ________________________________ From: Nitin Pawar <[EMAIL PROTECTED]> ToSent: Tuesday, April 24, 2012 12:46 PM Subject: Re: When/how to use partitions and buckets usefully? If you are doing a map side join make sure the table members_map is small enough to hold in memory On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote: Nitin Pawar Nitin Pawar
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-24, 16:09
I got the (rather big) log here in a github gist: https://gist.github.com/2480893
And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing YYYY-MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -----Original Message----- From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 3:58 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 11:46 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 200000 Hashtable size: 199999 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 6900000 Hashtable size: 6899999 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 9:47 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy ________________________________ From: Nitin Pawar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, April 24, 2012 12:46 PM Subject: Re: When/how to use partitions and buckets usefully? If you are doing a map side join make sure the table members_map is small enough to hold in memory On 4/24/12, Ruben de Vries <[EMAIL PROTECTED]> wrote: Nitin Pawar Nitin Pawar
-
Re: When/how to use partitions and buckets usefully?gemini alex 2012-04-25, 07:36
it's seemed you use the default hive configuration, the default map join
will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000 在 2012年4月25日 上午12:09��Ruben de Vries <[EMAIL PROTECTED]>写道: > I got the (rather big) log here in a github gist: > https://gist.github.com/2480893 > And I also attached the plan.xml it was using to the gist. > > When loading the members_map (11mil records, 320mb, 30b per record), it > seems to take about 198b per record in the members_map, resulting in > crashing around 7mil records with 1.4gb loaded. > > The members_map is a TEXTFILE with (member_id INT, gender INT, birthday > STRING) where > - birthday is a string containing YYYY-MM-DD > - gender is a tinyint, 1 2 or 3 > - member_id is int with the highest member_id being 14343249 (14mil) > > The log says: > "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, > string birthdate}" > > I also tried doing the same thing but with an empty visit_stats table, > with the same effect > Some of the blogs I read talk about 25mb small table, not 300mb like mine > ... > > Anyone can make anything out of this? > I'd rather go with this if at all possible, > otherwise I have to go the hard way and migrate all the visit_stats into > buckets so they can match the members_map on that? > > -----Original Message----- > From: Bejoy Ks [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 3:58 PM > To: [EMAIL PROTECTED] > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben > The operation you are seeing in your log is preparation of hash > table of the smaller table, This hash table file is compressed and loaded > into Distributed Cache and from there it is used for map side joins. From > your console log the hash table size/data size has gone to nearly 1.5 GB, > the data is large to be loaded into memory of the hive client. > > 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: > 6999999 Memory usage: 1,468,378,760 rate: 0.788 > > > Can you enable debug logging and post in the console to get a better > picture why it consumes this much memory. > Start your hive shell as > hive -hiveconf hive.root.logger=ALL,console; > > > Regards > Bejoy KS > > > > ________________________________ > From: Ruben de Vries <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Tuesday, April 24, 2012 4:58 PM > Subject: FW: When/how to use partitions and buckets usefully? > > Here are both tables: > > $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map > 1 1 247231757 > hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map > > $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats > 442 441 1091837835 > hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats > > The 'work' I'm seeing on console is the loading of the table into memory? > > It seems like it's loading the visit_stats table instead ?! > I tried doing MAPJOIN(visit_stats) but it fails non existing class (my > JSONSerde) . > > > From: Nitin Pawar [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 11:46 AM > To: [EMAIL PROTECTED] > Subject: Re: When/how to use partitions and buckets usefully? > > This operation is erroring out on the hive client itself before starting a > map so splitting to mappers is out of question. > > can you do a dfs count for the members_map table hdfslocation and tell us > the result? > > On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> > wrote: > Hmm I must be doing something wrong, the members_map table is 300ish MB. > When I execute the following query: > > SELECT > /*+ MAPJOIN(members_map) */ > date_int, > members_map.gender AS gender, > 'generic', > COUNT( memberId ) AS unique, > SUM( `generic`['count'] ) AS count, > SUM( `generic`['seconds'] ) AS seconds
-
Re: When/how to use partitions and buckets usefully?gemini alex 2012-04-25, 07:40
there should be documented in wiki on LanguageManual+Joins .
在 2012年4月25日 下午3:36��gemini alex <[EMAIL PROTECTED]>写道: > it's seemed you use the default hive configuration, the default map join > will have only 25M for small table, copy your hive-default.xml to > hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000 > > 在 2012年4月25日 上午12:09���Ruben de Vries <[EMAIL PROTECTED]>写道: > > I got the (rather big) log here in a github gist: >> https://gist.github.com/2480893 >> And I also attached the plan.xml it was using to the gist. >> >> When loading the members_map (11mil records, 320mb, 30b per record), it >> seems to take about 198b per record in the members_map, resulting in >> crashing around 7mil records with 1.4gb loaded. >> >> The members_map is a TEXTFILE with (member_id INT, gender INT, birthday >> STRING) where >> - birthday is a string containing YYYY-MM-DD >> - gender is a tinyint, 1 2 or 3 >> - member_id is int with the highest member_id being 14343249 (14mil) >> >> The log says: >> "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, >> string birthdate}" >> >> I also tried doing the same thing but with an empty visit_stats table, >> with the same effect >> Some of the blogs I read talk about 25mb small table, not 300mb like mine >> ... >> >> Anyone can make anything out of this? >> I'd rather go with this if at all possible, >> otherwise I have to go the hard way and migrate all the visit_stats into >> buckets so they can match the members_map on that? >> >> -----Original Message----- >> From: Bejoy Ks [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, April 24, 2012 3:58 PM >> To: [EMAIL PROTECTED] >> Subject: Re: When/how to use partitions and buckets usefully? >> >> Hi Ruben >> The operation you are seeing in your log is preparation of hash >> table of the smaller table, This hash table file is compressed and loaded >> into Distributed Cache and from there it is used for map side joins. From >> your console log the hash table size/data size has gone to nearly 1.5 GB, >> the data is large to be loaded into memory of the hive client. >> >> 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: >> 6999999 Memory usage: 1,468,378,760 rate: 0.788 >> >> >> Can you enable debug logging and post in the console to get a better >> picture why it consumes this much memory. >> Start your hive shell as >> hive -hiveconf hive.root.logger=ALL,console; >> >> >> Regards >> Bejoy KS >> >> >> >> ________________________________ >> From: Ruben de Vries <[EMAIL PROTECTED]> >> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> Sent: Tuesday, April 24, 2012 4:58 PM >> Subject: FW: When/how to use partitions and buckets usefully? >> >> Here are both tables: >> >> $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map >> 1 1 247231757 >> hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map >> >> $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats >> 442 441 1091837835 >> hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats >> >> The 'work' I'm seeing on console is the loading of the table into memory? >> >> It seems like it's loading the visit_stats table instead ?! >> I tried doing MAPJOIN(visit_stats) but it fails non existing class (my >> JSONSerde) . >> >> >> From: Nitin Pawar [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, April 24, 2012 11:46 AM >> To: [EMAIL PROTECTED] >> Subject: Re: When/how to use partitions and buckets usefully? >> >> This operation is erroring out on the hive client itself before starting >> a map so splitting to mappers is out of question. >> >> can you do a dfs count for the members_map table hdfslocation and tell us >> the result? >> >> On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> >> wrote: >> Hmm I must be doing something wrong, the members_map table is 300ish MB. >> When I execute the following query:
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-25, 07:48
I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint
From: gemini alex [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 25, 2012 9:40 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? there should be documented in wiki on LanguageManual+Joins . 在 2012年4月25日 下午3:36,gemini alex <[EMAIL PROTECTED]>写道: it's seemed you use the default hive configuration, the default map join will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000 在 2012年4月25日 上午12:09,Ruben de Vries <[EMAIL PROTECTED]>写道: I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing YYYY-MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -----Original Message----- From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 3:58 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 11:46 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 200000 Hashtable size: 199999 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 6900000 Hashtable size: 6899999 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple mappers in this case? From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 9:47 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben Map join hint is provided to hive using "MAPJOIN" keyword as : SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key To use map side join some hive configuration properties needs to be enabled For plain map side joins hive>SET hive.auto.convert.join=true; Latest versions of hive does a map join on the smaller table even if no map join hit is provided. For bucketed map joins hive>SET hive.optimize.bucketmapjoin=true https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Regards Bejoy ________________________________ From: Nitin Pawar <nitinpawar432@gmai
-
Re: When/how to use partitions and buckets usefully?Mark Grover 2012-04-26, 00:59
I am hoping that other people who have used Map Join can pitch in here...
When the smaller table gets loaded into mapper's memory, the data is loaded in its uncompressed form, right? If so, is it possible at all in this case that the compressed size of smaller table is less than the memory available but the uncompressed size isn't? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Ruben de Vries" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, April 25, 2012 3:48:46 AM Subject: RE: When/how to use partitions and buckets usefully? I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint From: gemini alex [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 25, 2012 9:40 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? there should be documented in wiki on LanguageManual+Joins . 在 2012年4月25日 下午3:36��gemini alex <[EMAIL PROTECTED]>写道: it's seemed you use the default hive configuration, the default map join will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000 在 2012年4月25日 上午12:09��Ruben de Vries <[EMAIL PROTECTED]>写道: I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing YYYY-MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -----Original Message----- From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 3:58 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 11:46 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 200000 Hashtable size: 199999 Memory usage: 43501848 rate: 0.023 2012-04-24 10:30:54 Processing rows: 6900000 Hashtable size: 6899999 Memory usage: 1449867552 rate: 0.778 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1468378760 rate: 0.788 Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple m
-
Re: When/how to use partitions and buckets usefully?gemini alex 2012-04-26, 03:37
do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml?
在 2012年4月26日 上午8:59��Mark Grover <[EMAIL PROTECTED]>写道: > I am hoping that other people who have used Map Join can pitch in here... > > When the smaller table gets loaded into mapper's memory, the data is > loaded in its uncompressed form, right? > > If so, is it possible at all in this case that the compressed size of > smaller table is less than the memory available but the uncompressed size > isn't? > > Mark > > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > e: [EMAIL PROTECTED] > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Ruben de Vries" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wednesday, April 25, 2012 3:48:46 AM > Subject: RE: When/how to use partitions and buckets usefully? > > I already tried running with that set to 400mb, but it didn’t work and > that setting is only used when it’s trying to automatically figure out if > it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a > hint > > From: gemini alex [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, April 25, 2012 9:40 AM > To: [EMAIL PROTECTED] > Subject: Re: When/how to use partitions and buckets usefully? > > there should be documented in wiki on LanguageManual+Joins . > 在 2012年4月25日 下午3:36��gemini alex <[EMAIL PROTECTED]>写道: > it's seemed you use the default hive configuration, the default map join > will have only 25M for small table, copy your hive-default.xml to > hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000 > 在 2012年4月25日 上午12:09���Ruben de Vries <[EMAIL PROTECTED]>写道: > > I got the (rather big) log here in a github gist: > https://gist.github.com/2480893 > And I also attached the plan.xml it was using to the gist. > > When loading the members_map (11mil records, 320mb, 30b per record), it > seems to take about 198b per record in the members_map, resulting in > crashing around 7mil records with 1.4gb loaded. > > The members_map is a TEXTFILE with (member_id INT, gender INT, birthday > STRING) where > - birthday is a string containing YYYY-MM-DD > - gender is a tinyint, 1 2 or 3 > - member_id is int with the highest member_id being 14343249 (14mil) > > The log says: > "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, > string birthdate}" > > I also tried doing the same thing but with an empty visit_stats table, > with the same effect > Some of the blogs I read talk about 25mb small table, not 300mb like mine > ... > > Anyone can make anything out of this? > I'd rather go with this if at all possible, > otherwise I have to go the hard way and migrate all the visit_stats into > buckets so they can match the members_map on that? > > -----Original Message----- > From: Bejoy Ks [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, April 24, 2012 3:58 PM > To: [EMAIL PROTECTED] > Subject: Re: When/how to use partitions and buckets usefully? > > Hi Ruben > The operation you are seeing in your log is preparation of hash table > of the smaller table, This hash table file is compressed and loaded into > Distributed Cache and from there it is used for map side joins. From your > console log the hash table size/data size has gone to nearly 1.5 GB, the > data is large to be loaded into memory of the hive client. > > 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: > 6999999 Memory usage: 1,468,378,760 rate: 0.788 > > > Can you enable debug logging and post in the console to get a better > picture why it consumes this much memory. > Start your hive shell as > hive -hiveconf hive.root.logger=ALL,console; > > > Regards > Bejoy KS > > > > ________________________________ > From: Ruben de Vries <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Tuesday, April 24, 2012 4:58 PM
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-26, 07:16
The table isn't compressed, the uncompressed size is really 300ish mb.
Also I haven't set the mapred.child.java.opts, but I think that if it crashes at 1.5gb that the default value seems high enough? From: gemini alex [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 26, 2012 5:37 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml? 在 2012年4月26日 上午8:59,Mark Grover <[EMAIL PROTECTED]>写道: I am hoping that other people who have used Map Join can pitch in here... When the smaller table gets loaded into mapper's memory, the data is loaded in its uncompressed form, right? If so, is it possible at all in this case that the compressed size of smaller table is less than the memory available but the uncompressed size isn't? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Ruben de Vries" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, April 25, 2012 3:48:46 AM Subject: RE: When/how to use partitions and buckets usefully? I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint From: gemini alex [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 25, 2012 9:40 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? there should be documented in wiki on LanguageManual+Joins . 在 2012年4月25日 下午3:36,gemini alex <[EMAIL PROTECTED]>写道: it's seemed you use the default hive configuration, the default map join will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000 在 2012年4月25日 上午12:09,Ruben de Vries <[EMAIL PROTECTED]>写道: I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing YYYY-MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? -----Original Message----- From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 3:58 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) . From: Nitin Pawar [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 11:46 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question. can you do a dfs count for the members_map table hdfslocation and tell us the result? On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote: Hmm I must be doing something wrong, the members_map table is 300ish MB. When I execute the following query: SELECT /*+ MAPJOIN(members_map) */ date_int, members_map.gender AS gender, 'generic', COUNT( memberId ) AS unique, SUM( `generic`['count'] ) AS count, SUM( `generic`['seconds'] ) AS seconds FROM visit_stats JOIN members_map ON(members_map.member_id = visit_stats.memberId) GROUP BY date_int, members_map.gender It results in: 2012-04-24 10:25:59 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-24 10:26:00 Processing rows: 200000 Hashtable size: 199999
-
RE: When/how to use partitions and buckets usefully?Ruben de Vries 2012-04-26, 09:06
We're atm sampling our access logs 1:250 by doing a modulo of 250 on the memberID.
I applied the same logic to the members_map, reducing its raw size to just 0.9mb. Now when I run the query with the MAPJOIN hint it goes: 2012-04-26 10:51:32 Starting to launch local task to process map join; maximum memory = 1864171520 2012-04-26 10:51:33 Processing rows: 46490 Hashtable size: 46490 Memory usage: 11142520 rate: 0.006 - done - So it uses 10mb in memory to load the 1mb raw size members_map, this matches the ratio which I previously had when loading the full table and crashing (250mb required 2.6gb). I'm not sure why it requires so much more memory then the raw filesize, but it's defenatly the main problem here. Now it's just the question of is this a bug? Or is explainable and acceptable? FYI my most complex query dropped from 350sec to 110sec when being able to MAPJOIN(), gotta love that speed if it works! -----Original Message----- From: Ruben de Vries [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 26, 2012 9:16 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: When/how to use partitions and buckets usefully? The table isn't compressed, the uncompressed size is really 300ish mb. Also I haven't set the mapred.child.java.opts, but I think that if it crashes at 1.5gb that the default value seems high enough? From: gemini alex [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 26, 2012 5:37 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? do you set your mapred.child.java.opts in hadoop/conf/mapred-site.xml? 在 2012年4月26日 上午8:59,Mark Grover <[EMAIL PROTECTED]>写道: I am hoping that other people who have used Map Join can pitch in here... When the smaller table gets loaded into mapper's memory, the data is loaded in its uncompressed form, right? If so, is it possible at all in this case that the compressed size of smaller table is less than the memory available but the uncompressed size isn't? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: [EMAIL PROTECTED] "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Ruben de Vries" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, April 25, 2012 3:48:46 AM Subject: RE: When/how to use partitions and buckets usefully? I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint From: gemini alex [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 25, 2012 9:40 AM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? there should be documented in wiki on LanguageManual+Joins . 在 2012年4月25日 下午3:36,gemini alex <[EMAIL PROTECTED]>写道: it's seemed you use the default hive configuration, the default map join will have only 25M for small table, copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000 在 2012年4月25日 上午12:09,Ruben de Vries <[EMAIL PROTECTED]>写道: I got the (rather big) log here in a github gist: https://gist.github.com/2480893 And I also attached the plan.xml it was using to the gist. When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded. The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where - birthday is a string containing YYYY-MM-DD - gender is a tinyint, 1 2 or 3 - member_id is int with the highest member_id being 14343249 (14mil) The log says: "INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}" I also tried doing the same thing but with an empty visit_stats table, with the same effect Some of the blogs I read talk about 25mb small table, not 300mb like mine ... Anyone can make anything out of this? I'd rather go with this if at all possible, otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that? From: Bejoy Ks [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 24, 2012 3:58 PM To: [EMAIL PROTECTED] Subject: Re: When/how to use partitions and buckets usefully? Hi Ruben The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client. 2012-04-24 10:31:02 Processing rows: 7000000 Hashtable size: 6999999 Memory usage: 1,468,378,760 rate: 0.788 Can you enable debug logging and post in the console to get a better picture why it consumes this much memory. Start your hive shell as hive -hiveconf hive.root.logger=ALL,console; Regards Bejoy KS ________________________________ From: Ruben de Vries <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, April 24, 2012 4:58 PM Subject: FW: When/how to use partitions and buckets usefully? Here are both tables: $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map 1 1 247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map $ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats 442 441 1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats The 'work' I'm seeing on console is the loading of the table into memory? It seems like it's loading the visit_stats table instead ?! I tried doing MAPJOIN(visit_ |