Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> When/how to use partitions and buckets usefully?

Copy link to this message
Re: When/how to use partitions and buckets usefully?
I am hoping that other people who have used Map Join can pitch in here...

When the smaller table gets loaded into mapper's memory, the data is loaded in its uncompressed form, right?

If so, is it possible at all in this case that the compressed size of smaller table is less than the memory available but the uncompressed size isn't?


Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.
----- Original Message -----
From: "Ruben de Vries" <[EMAIL PROTECTED]>
Sent: Wednesday, April 25, 2012 3:48:46 AM
Subject: RE: When/how to use partitions and buckets usefully?

I already tried running with that set to 400mb, but it didn’t work and that setting is only used when it’s trying to automatically figure out if it should be doing a mapjoin, while I’m forcing it to do a mapjoin with a hint

From: gemini alex [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, April 25, 2012 9:40 AM
Subject: Re: When/how to use partitions and buckets usefully?

there should be documented in wiki on LanguageManual+Joins .
在 2012年4月25日 下午3:36��gemini alex <[EMAIL PROTECTED]>写道:
it's seemed you use the default hive configuration, the default map join will have only 25M for small table,  copy your hive-default.xml to hive-site.xml and set hive.mapjoin.smalltable.filesize=300000000
在 2012年4月25日 上午12:09��Ruben de Vries <[EMAIL PROTECTED]>写道:

I got the (rather big) log here in a github gist: https://gist.github.com/2480893
And I also attached the plan.xml it was using to the gist.

When loading the members_map (11mil records, 320mb, 30b per record), it seems to take about 198b per record in the members_map, resulting in crashing around 7mil records with 1.4gb loaded.

The members_map is a TEXTFILE with (member_id INT, gender INT, birthday STRING) where
 - birthday is a string containing YYYY-MM-DD
 - gender is a tinyint, 1 2 or 3
 - member_id is int with the highest member_id being 14343249 (14mil)

The log says:
"INFO hive.log: DDL: struct members_map { i32 member_id, i32 gender, string birthdate}"

I also tried doing the same thing but with an empty visit_stats table, with the same effect
Some of the blogs I read talk about 25mb small table, not 300mb like mine ...

Anyone can make anything out of this?
I'd rather go with this if at all possible,
otherwise I have to go the hard way and migrate all the visit_stats into buckets so they can match the members_map on that?

-----Original Message-----
From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 24, 2012 3:58 PM
Subject: Re: When/how to use partitions and buckets usefully?

Hi Ruben
      The operation you are seeing in your log is preparation of hash table of the smaller table, This hash table file is compressed and loaded into Distributed Cache and from there it is used for map side joins. From your console log the hash table size/data size has gone to nearly 1.5 GB, the data is large to be loaded into memory of the hive client.

2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 6999999 Memory usage:   1,468,378,760      rate:   0.788
Can you enable debug logging and post in the console to get a better picture why it consumes this much memory.
Start your hive shell as
hive -hiveconf hive.root.logger=ALL,console;
Bejoy KS

From: Ruben de Vries <[EMAIL PROTECTED]>
Sent: Tuesday, April 24, 2012 4:58 PM
Subject: FW: When/how to use partitions and buckets usefully?

Here are both tables:

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/members_map
           1            1          247231757 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/members_map

$ hdfs -count /user/hive/warehouse/hyves_goldmine.db/visit_stats
         442          441         1091837835 hdfs://localhost:54310/user/hive/warehouse/hyves_goldmine.db/visit_stats

The 'work' I'm seeing on console is the loading of the table into memory?

It seems like it's loading the visit_stats table instead ?!
I tried doing MAPJOIN(visit_stats) but it fails non existing class (my JSONSerde) .
From: Nitin Pawar [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 24, 2012 11:46 AM
Subject: Re: When/how to use partitions and buckets usefully?

This operation is erroring out on the hive client itself before starting a map so splitting to mappers is out of question.

can you do a dfs count for the members_map table hdfslocation and tell us the result?

On Tue, Apr 24, 2012 at 2:06 PM, Ruben de Vries <[EMAIL PROTECTED]> wrote:
Hmm I must be doing something wrong,  the members_map table is 300ish MB.
When I execute the following query:

   /*+ MAPJOIN(members_map) */
   members_map.gender AS gender,
   COUNT( memberId ) AS unique,
   SUM( `generic`['count'] ) AS count,
   SUM( `generic`['seconds'] ) AS seconds
FROM visit_stats
JOIN members_map ON(members_map.member_id = visit_stats.memberId)
GROUP BY date_int, members_map.gender

It results in:
2012-04-24 10:25:59     Starting to launch local task to process map join;      maximum memory = 1864171520
2012-04-24 10:26:00     Processing rows:        200000          Hashtable size: 199999          Memory usage:   43501848        rate:   0.023
2012-04-24 10:30:54     Processing rows:        6900000 Hashtable size: 6899999 Memory usage:   1449867552      rate:   0.778
2012-04-24 10:31:02     Processing rows:        7000000 Hashtable size: 6999999 Memory usage:   1468378760      rate:   0.788
Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
I'm running it only my local, single node, dev env, could that be a problem since it won't split over multiple m