|
|
Hi all,
I found 2 parameters which have something to do with mapjoin, that is :
hive.mapjoin.cache.numrows hive.mapjoin.size.key
I can't find any formal document on that 2 parameters.
I guess "hive.mapjoin.cache.numrows" sets the maximum row count of the small table in map join, and rows more than that setting will be disposed. Once I use map join with a 50000+ rows table, some records can't be joined, and I fixed the problem by increasing "hive.mapjoin.cache.numrows".
However, sometimes I still get OOM exception even if the " hive.mapjoin.cache.numrows" parameter is not set (by default, 25000 I guess).
Please explain me the usage of the parameters if you know, thanks.
-- Best Regards, Ted Xu
-
Re: Mapjoin parameters?
John Sichi 2010-08-19, 20:56
For hive.mapjoin.cache.numrows, I found this in hive/conf/hive-default.xml:
<property> <name>hive.mapjoin.cache.numrows</name> <value>25000</value> <description>How many rows should be cached by jdbm for map join. </description> </property>
hive.mapjoin.size is missing from hive-default.xml; can you create a JIRA issue for that?
JVS
On Aug 19, 2010, at 1:07 AM, Ted Xu wrote:
Hi all,
I found 2 parameters which have something to do with mapjoin, that is :
hive.mapjoin.cache.numrows hive.mapjoin.size.key
I can't find any formal document on that 2 parameters.
I guess "hive.mapjoin.cache.numrows" sets the maximum row count of the small table in map join, and rows more than that setting will be disposed. Once I use map join with a 50000+ rows table, some records can't be joined, and I fixed the problem by increasing "hive.mapjoin.cache.numrows".
However, sometimes I still get OOM exception even if the "hive.mapjoin.cache.numrows" parameter is not set (by default, 25000 I guess).
Please explain me the usage of the parameters if you know, thanks.
-- Best Regards, Ted Xu
-
Re: Mapjoin parameters?
Ted Xu 2010-08-20, 01:44
Thanks John, I'll create an issue for that.
PS: So in mapjoin only the first 25000 rows in the small table will be cached by default, I'm I right? If the small table is more than 25000 rows, we will miss certain proportion of data without any warning or exception?
在 2010年8月20日 上午4:56,John Sichi <[EMAIL PROTECTED]>写道:
> For hive.mapjoin.cache.numrows, I found this in hive/conf/hive-default.xml: > > <property> > <name>hive.mapjoin.cache.numrows</name> > <value>25000</value> > <description>How many rows should be cached by jdbm for map join. > </description> > </property> > > hive.mapjoin.size is missing from hive-default.xml; can you create a JIRA > issue for that? > > JVS > > On Aug 19, 2010, at 1:07 AM, Ted Xu wrote: > > Hi all, > > I found 2 parameters which have something to do with mapjoin, that is : > > hive.mapjoin.cache.numrows > hive.mapjoin.size.key > > I can't find any formal document on that 2 parameters. > > I guess "hive.mapjoin.cache.numrows" sets the maximum row count of the > small table in map join, and rows more than that setting will be disposed. > Once I use map join with a 50000+ rows table, some records can't be joined, > and I fixed the problem by increasing "hive.mapjoin.cache.numrows". > > However, sometimes I still get OOM exception even if the " > hive.mapjoin.cache.numrows" parameter is not set (by default, 25000 I > guess). > > Please explain me the usage of the parameters if you know, thanks. > > -- > Best Regards, > Ted Xu > > > -- Best Regards, Ted Xu
-
Re: Mapjoin parameters?
Ted Yu 2010-08-20, 17:38
No. A RowContainer will be created based on hive.mapjoin.bucket.cache.size whose default size is 100.
See line 223 in MapJoinOperator.processOp(): if (o == null) { int bucketSize = HiveConf.getIntVar(hconf, HiveConf.ConfVars.HIVEMAPJOINBUCKETCACHESIZE); res = getRowContainer(hconf, (byte) tag, order[tag], bucketSize); res.add(value); 2010/8/19 Ted Xu <[EMAIL PROTECTED]>
> Thanks John, I'll create an issue for that. > > PS: So in mapjoin only the first 25000 rows in the small table will be > cached by default, I'm I right? If the small table is more than 25000 rows, > we will miss certain proportion of data without any warning or exception? > > 在 2010年8月20日 上午4:56,John Sichi <[EMAIL PROTECTED]>写道: > > For hive.mapjoin.cache.numrows, I found this in hive/conf/hive-default.xml: >> >> <property> >> <name>hive.mapjoin.cache.numrows</name> >> <value>25000</value> >> <description>How many rows should be cached by jdbm for map join. >> </description> >> </property> >> >> hive.mapjoin.size is missing from hive-default.xml; can you create a JIRA >> issue for that? >> >> JVS >> >> On Aug 19, 2010, at 1:07 AM, Ted Xu wrote: >> >> Hi all, >> >> I found 2 parameters which have something to do with mapjoin, that is : >> >> hive.mapjoin.cache.numrows >> hive.mapjoin.size.key >> >> I can't find any formal document on that 2 parameters. >> >> I guess "hive.mapjoin.cache.numrows" sets the maximum row count of the >> small table in map join, and rows more than that setting will be disposed. >> Once I use map join with a 50000+ rows table, some records can't be joined, >> and I fixed the problem by increasing "hive.mapjoin.cache.numrows". >> >> However, sometimes I still get OOM exception even if the " >> hive.mapjoin.cache.numrows" parameter is not set (by default, 25000 I >> guess). >> >> Please explain me the usage of the parameters if you know, thanks. >> >> -- >> Best Regards, >> Ted Xu >> >> >> > > > -- > Best Regards, > Ted Xu >
|
|