|
|
-
getSplits() in TableInputFormatBase
john smith 2010-04-11, 07:54
Hi ,
In the method "public org.apache.hadoop.mapred.InputSplit[] *getSplits* (org.apache.hadoop.mapred.JobConf job,
int numSplits) "
how is the "numSplits" decided ? I've seen differnt values of numSplits for different MR jobs . Any reason for this ?
Also what if I ignore numsplits and always split at region boundaries.I guess that , splitting at region boundaries makes more sense and improves some what data locality.
Any comments on the above statement?
Thanks
j.S
-
Re: getSplits() in TableInputFormatBase
Amandeep Khurana 2010-04-11, 08:03
The number of splits is equal to the number of regions...
On Sun, Apr 11, 2010 at 12:54 AM, john smith <[EMAIL PROTECTED]> wrote:
> Hi , > > In the method "public org.apache.hadoop.mapred.InputSplit[] *getSplits* > (org.apache.hadoop.mapred.JobConf job, > > int numSplits) " > > how is the "numSplits" decided ? I've seen differnt values of > numSplits for different MR jobs . Any reason for this ? > > Also what if I ignore numsplits and always split at region > boundaries.I guess that , splitting at region boundaries makes more > sense and improves some what data locality. > > Any comments on the above statement? > > Thanks > > j.S >
-
Re: getSplits() in TableInputFormatBase
john smith 2010-04-11, 08:15
Amandeep, I guess that is not true ,.. See the explanation as in docs .. "Splits are created in number equal to the smallest between numSplits and the number of HRegion< http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html>sin the table. If the number of splits is smaller than the number of HRegion< http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html>sthen splits are spanned across multiple HRegion< http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html>sand are grouped the most evenly possible. In the case splits are uneven the bigger splits are placed first in the InputSplit array. " depending on whether numSplits < (or >) num of regions .. it choses real number of splits and the same is done in the code // Code int realNumSplits = numSplits > startKeys.length? startKeys.length: numSplits; Here startKeys.length is the number of regions... Am I true? Thanks j.S On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote: > The number of splits is equal to the number of regions... > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <[EMAIL PROTECTED]> > wrote: > > > Hi , > > > > In the method "public org.apache.hadoop.mapred.InputSplit[] *getSplits* > > (org.apache.hadoop.mapred.JobConf job, > > > > int numSplits) " > > > > how is the "numSplits" decided ? I've seen differnt values of > > numSplits for different MR jobs . Any reason for this ? > > > > Also what if I ignore numsplits and always split at region > > boundaries.I guess that , splitting at region boundaries makes more > > sense and improves some what data locality. > > > > Any comments on the above statement? > > > > Thanks > > > > j.S > > >
-
Re: getSplits() in TableInputFormatBase
Amandeep Khurana 2010-04-11, 08:20
If you set the number of map tasks as a higher number than the number of regions (I generally set it to 100000 or something like that), the number of splits = number of regions. If you keep it lower, then it combines regions in a single split. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Apr 11, 2010 at 1:15 AM, john smith <[EMAIL PROTECTED]> wrote: > Amandeep, > > I guess that is not true ,.. See the explanation as in docs .. > > > "Splits are created in number equal to the smallest between numSplits and > the number of HRegion< > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> >s > in the table. If the number of splits is smaller than the number of > HRegion< > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> >s > then splits are spanned across multiple > HRegion< > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> >s > and are grouped the most evenly possible. In the case splits are uneven the > bigger splits are placed first in the InputSplit array. " > > > depending on whether numSplits < (or >) num of regions .. it choses real > number of splits and the same is done in the code > > // Code > int realNumSplits = numSplits > startKeys.length? startKeys.length: > numSplits; > > Here startKeys.length is the number of regions... > > Am I true? > > Thanks > j.S > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <[EMAIL PROTECTED]> > wrote: > > > The number of splits is equal to the number of regions... > > > > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <[EMAIL PROTECTED]> > > wrote: > > > > > Hi , > > > > > > In the method "public org.apache.hadoop.mapred.InputSplit[] > *getSplits* > > > (org.apache.hadoop.mapred.JobConf job, > > > > > > int numSplits) " > > > > > > how is the "numSplits" decided ? I've seen differnt values of > > > numSplits for different MR jobs . Any reason for this ? > > > > > > Also what if I ignore numsplits and always split at region > > > boundaries.I guess that , splitting at region boundaries makes more > > > sense and improves some what data locality. > > > > > > Any comments on the above statement? > > > > > > Thanks > > > > > > j.S > > > > > >
-
Re: getSplits() in TableInputFormatBase
john smith 2010-04-11, 08:39
Amandeep , Thanks for the explanation . What is the default value to the num of maps ? Is it not equal to the num of regions ? Right now I am running HBase in pseudo distributed mode . If I set num of map tasks to 100000 (some big num).. I get numSplits=1 If I dont set any thing .. numSplits =2; Can you explain this. Thanks j.S On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote: > If you set the number of map tasks as a higher number than the number of > regions (I generally set it to 100000 or something like that), the number > of > splits = number of regions. If you keep it lower, then it combines regions > in a single split. > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith <[EMAIL PROTECTED]> > wrote: > > > Amandeep, > > > > I guess that is not true ,.. See the explanation as in docs .. > > > > > > "Splits are created in number equal to the smallest between numSplits and > > the number of HRegion< > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > >s > > in the table. If the number of splits is smaller than the number of > > HRegion< > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > >s > > then splits are spanned across multiple > > HRegion< > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > >s > > and are grouped the most evenly possible. In the case splits are uneven > the > > bigger splits are placed first in the InputSplit array. " > > > > > > depending on whether numSplits < (or >) num of regions .. it choses real > > number of splits and the same is done in the code > > > > // Code > > int realNumSplits = numSplits > startKeys.length? startKeys.length: > > numSplits; > > > > Here startKeys.length is the number of regions... > > > > Am I true? > > > > Thanks > > j.S > > > > > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > wrote: > > > > > The number of splits is equal to the number of regions... > > > > > > > > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi , > > > > > > > > In the method "public org.apache.hadoop.mapred.InputSplit[] > > *getSplits* > > > > (org.apache.hadoop.mapred.JobConf job, > > > > > > > > int numSplits) > " > > > > > > > > how is the "numSplits" decided ? I've seen differnt values of > > > > numSplits for different MR jobs . Any reason for this ? > > > > > > > > Also what if I ignore numsplits and always split at region > > > > boundaries.I guess that , splitting at region boundaries makes more > > > > sense and improves some what data locality. > > > > > > > > Any comments on the above statement? > > > > > > > > Thanks > > > > > > > > j.S > > > > > > > > > >
-
Re: getSplits() in TableInputFormatBase
Amandeep Khurana 2010-04-11, 08:53
How many regions do you have? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Apr 11, 2010 at 1:39 AM, john smith <[EMAIL PROTECTED]> wrote: > Amandeep , > > Thanks for the explanation . What is the default value to the num of maps ? > Is it not equal to the num of regions ? > > Right now I am running HBase in pseudo distributed mode . If I set num of > map tasks to 100000 (some big num).. > > I get numSplits=1 > > If I dont set any thing .. numSplits =2; > > > Can you explain this. > > Thanks > j.S > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <[EMAIL PROTECTED]> > wrote: > > > If you set the number of map tasks as a higher number than the number of > > regions (I generally set it to 100000 or something like that), the number > > of > > splits = number of regions. If you keep it lower, then it combines > regions > > in a single split. > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith <[EMAIL PROTECTED]> > > wrote: > > > > > Amandeep, > > > > > > I guess that is not true ,.. See the explanation as in docs .. > > > > > > > > > "Splits are created in number equal to the smallest between numSplits > and > > > the number of HRegion< > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > >s > > > in the table. If the number of splits is smaller than the number of > > > HRegion< > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > >s > > > then splits are spanned across multiple > > > HRegion< > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > >s > > > and are grouped the most evenly possible. In the case splits are uneven > > the > > > bigger splits are placed first in the InputSplit array. " > > > > > > > > > depending on whether numSplits < (or >) num of regions .. it choses > real > > > number of splits and the same is done in the code > > > > > > // Code > > > int realNumSplits = numSplits > startKeys.length? startKeys.length: > > > numSplits; > > > > > > Here startKeys.length is the number of regions... > > > > > > Am I true? > > > > > > Thanks > > > j.S > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > > wrote: > > > > > > > The number of splits is equal to the number of regions... > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > Hi , > > > > > > > > > > In the method "public org.apache.hadoop.mapred.InputSplit[] > > > *getSplits* > > > > > (org.apache.hadoop.mapred.JobConf job, > > > > > > > > > > int > numSplits) > > " > > > > > > > > > > how is the "numSplits" decided ? I've seen differnt values of > > > > > numSplits for different MR jobs . Any reason for this ? > > > > > > > > > > Also what if I ignore numsplits and always split at region > > > > > boundaries.I guess that , splitting at region boundaries makes more > > > > > sense and improves some what data locality. > > > > > > > > > > Any comments on the above statement? > > > > > > > > > > Thanks > > > > > > > > > > j.S > > > > > > > > > > > > > > >
-
Re: getSplits() in TableInputFormatBase
john smith 2010-04-11, 08:57
>From the web interface... number of regions =5 number of tables = 3 Thanks On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote: > How many regions do you have? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Apr 11, 2010 at 1:39 AM, john smith <[EMAIL PROTECTED]> > wrote: > > > Amandeep , > > > > Thanks for the explanation . What is the default value to the num of maps > ? > > Is it not equal to the num of regions ? > > > > Right now I am running HBase in pseudo distributed mode . If I set num of > > map tasks to 100000 (some big num).. > > > > I get numSplits=1 > > > > If I dont set any thing .. numSplits =2; > > > > > > Can you explain this. > > > > Thanks > > j.S > > > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > wrote: > > > > > If you set the number of map tasks as a higher number than the number > of > > > regions (I generally set it to 100000 or something like that), the > number > > > of > > > splits = number of regions. If you keep it lower, then it combines > > regions > > > in a single split. > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Amandeep, > > > > > > > > I guess that is not true ,.. See the explanation as in docs .. > > > > > > > > > > > > "Splits are created in number equal to the smallest between numSplits > > and > > > > the number of HRegion< > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > >s > > > > in the table. If the number of splits is smaller than the number of > > > > HRegion< > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > >s > > > > then splits are spanned across multiple > > > > HRegion< > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > >s > > > > and are grouped the most evenly possible. In the case splits are > uneven > > > the > > > > bigger splits are placed first in the InputSplit array. " > > > > > > > > > > > > depending on whether numSplits < (or >) num of regions .. it choses > > real > > > > number of splits and the same is done in the code > > > > > > > > // Code > > > > int realNumSplits = numSplits > startKeys.length? startKeys.length: > > > > numSplits; > > > > > > > > Here startKeys.length is the number of regions... > > > > > > > > Am I true? > > > > > > > > Thanks > > > > j.S > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > The number of splits is equal to the number of regions... > > > > > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith < > [EMAIL PROTECTED] > > > > > > > > wrote: > > > > > > > > > > > Hi , > > > > > > > > > > > > In the method "public org.apache.hadoop.mapred.InputSplit[] > > > > *getSplits* > > > > > > (org.apache.hadoop.mapred.JobConf job, > > > > > > > > > > > > int > > numSplits) > > > " > > > > > > > > > > > > how is the "numSplits" decided ? I've seen differnt values of > > > > > > numSplits for different MR jobs . Any reason for this ? > > > > > > > > > > > > Also what if I ignore numsplits and always split at region > > > > > > boundaries.I guess that , splitting at region boundaries makes > more > > > > > > sense and improves some what data locality. > > > > > > > > > > > > Any comments on the above statement? > > > > > > > > > > > > Thanks > > > > > > > > > > > > j.S > > > > > > > > > > > > > > > > > > > > >
-
Re: getSplits() in TableInputFormatBase
Amandeep Khurana 2010-04-11, 09:10
3 tables? are you counting root and meta also? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Apr 11, 2010 at 1:57 AM, john smith <[EMAIL PROTECTED]> wrote: > From the web interface... > > > number of regions =5 > number of tables = 3 > > Thanks > > > On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana <[EMAIL PROTECTED]> > wrote: > > > How many regions do you have? > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Apr 11, 2010 at 1:39 AM, john smith <[EMAIL PROTECTED]> > > wrote: > > > > > Amandeep , > > > > > > Thanks for the explanation . What is the default value to the num of > maps > > ? > > > Is it not equal to the num of regions ? > > > > > > Right now I am running HBase in pseudo distributed mode . If I set num > of > > > map tasks to 100000 (some big num).. > > > > > > I get numSplits=1 > > > > > > If I dont set any thing .. numSplits =2; > > > > > > > > > Can you explain this. > > > > > > Thanks > > > j.S > > > > > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > > wrote: > > > > > > > If you set the number of map tasks as a higher number than the number > > of > > > > regions (I generally set it to 100000 or something like that), the > > number > > > > of > > > > splits = number of regions. If you keep it lower, then it combines > > > regions > > > > in a single split. > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Amandeep, > > > > > > > > > > I guess that is not true ,.. See the explanation as in docs .. > > > > > > > > > > > > > > > "Splits are created in number equal to the smallest between > numSplits > > > and > > > > > the number of HRegion< > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > >s > > > > > in the table. If the number of splits is smaller than the number of > > > > > HRegion< > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > >s > > > > > then splits are spanned across multiple > > > > > HRegion< > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > >s > > > > > and are grouped the most evenly possible. In the case splits are > > uneven > > > > the > > > > > bigger splits are placed first in the InputSplit array. " > > > > > > > > > > > > > > > depending on whether numSplits < (or >) num of regions .. it > choses > > > real > > > > > number of splits and the same is done in the code > > > > > > > > > > // Code > > > > > int realNumSplits = numSplits > startKeys.length? > startKeys.length: > > > > > numSplits; > > > > > > > > > > Here startKeys.length is the number of regions... > > > > > > > > > > Am I true? > > > > > > > > > > Thanks > > > > > j.S > > > > > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > The number of splits is equal to the number of regions... > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith < > > [EMAIL PROTECTED] > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi , > > > > > > > > > > > > > > In the method "public org.apache.hadoop.mapred.InputSplit[] > > > > > *getSplits* > > > > > > > (org.apache.hadoop.mapred.JobConf job, > > > > > > > > > > > > > > int > > > numSplits) > > > > " > > > > > > > > > > > > > > how is the "numSplits" decided ? I've seen differnt values of > > > > > > > numSplits for different MR jobs . Any reason for this ? > > > > > > > > > > > > > > Also what if I ignore numsplits and always split at region
-
Re: getSplits() in TableInputFormatBase
john smith 2010-04-11, 09:23
Amandeep, No . I have 3 tables A,B,C ..Does the number of regions 5 include 1 region from each META and ROOT also? I should get numSplits = 3 (total number of user regions) . But I am getting 1 . Thanks On Sun, Apr 11, 2010 at 2:40 PM, Amandeep Khurana <[EMAIL PROTECTED]> wrote: > 3 tables? are you counting root and meta also? > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Sun, Apr 11, 2010 at 1:57 AM, john smith <[EMAIL PROTECTED]> > wrote: > > > From the web interface... > > > > > > number of regions =5 > > number of tables = 3 > > > > Thanks > > > > > > On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > wrote: > > > > > How many regions do you have? > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Sun, Apr 11, 2010 at 1:39 AM, john smith <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Amandeep , > > > > > > > > Thanks for the explanation . What is the default value to the num of > > maps > > > ? > > > > Is it not equal to the num of regions ? > > > > > > > > Right now I am running HBase in pseudo distributed mode . If I set > num > > of > > > > map tasks to 100000 (some big num).. > > > > > > > > I get numSplits=1 > > > > > > > > If I dont set any thing .. numSplits =2; > > > > > > > > > > > > Can you explain this. > > > > > > > > Thanks > > > > j.S > > > > > > > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > If you set the number of map tasks as a higher number than the > number > > > of > > > > > regions (I generally set it to 100000 or something like that), the > > > number > > > > > of > > > > > splits = number of regions. If you keep it lower, then it combines > > > > regions > > > > > in a single split. > > > > > > > > > > > > > > > Amandeep Khurana > > > > > Computer Science Graduate Student > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Amandeep, > > > > > > > > > > > > I guess that is not true ,.. See the explanation as in docs .. > > > > > > > > > > > > > > > > > > "Splits are created in number equal to the smallest between > > numSplits > > > > and > > > > > > the number of HRegion< > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > > >s > > > > > > in the table. If the number of splits is smaller than the number > of > > > > > > HRegion< > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > > >s > > > > > > then splits are spanned across multiple > > > > > > HRegion< > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > > >s > > > > > > and are grouped the most evenly possible. In the case splits are > > > uneven > > > > > the > > > > > > bigger splits are placed first in the InputSplit array. " > > > > > > > > > > > > > > > > > > depending on whether numSplits < (or >) num of regions .. it > > choses > > > > real > > > > > > number of splits and the same is done in the code > > > > > > > > > > > > // Code > > > > > > int realNumSplits = numSplits > startKeys.length? > > startKeys.length: > > > > > > numSplits; > > > > > > > > > > > > Here startKeys.length is the number of regions... > > > > > > > > > > > > Am I true? > > > > > > > > > > > > Thanks > > > > > > j.S > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana < > > [EMAIL PROTECTED]> > > > > > > wrote: > > > > > > > > > > > > > The number of splits is equal to the number of regions... > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <
-
Re: getSplits() in TableInputFormatBase
Amandeep Khurana 2010-04-11, 09:27
You have 1 region per table and thats why you are getting 1 split when you scan any of those tables... Moreover, the number of map tasks configuration is ignored when you are running in pseudo dist mode since the job tracker is local. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sun, Apr 11, 2010 at 2:23 AM, john smith <[EMAIL PROTECTED]> wrote: > Amandeep, > > No . I have 3 tables A,B,C ..Does the number of regions 5 include 1 region > from each META and ROOT also? > > I should get numSplits = 3 (total number of user regions) . But I am > getting > 1 . > > Thanks > > > > > > > > > On Sun, Apr 11, 2010 at 2:40 PM, Amandeep Khurana <[EMAIL PROTECTED]> > wrote: > > > 3 tables? are you counting root and meta also? > > > > > > Amandeep Khurana > > Computer Science Graduate Student > > University of California, Santa Cruz > > > > > > On Sun, Apr 11, 2010 at 1:57 AM, john smith <[EMAIL PROTECTED]> > > wrote: > > > > > From the web interface... > > > > > > > > > number of regions =5 > > > number of tables = 3 > > > > > > Thanks > > > > > > > > > On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana <[EMAIL PROTECTED]> > > > wrote: > > > > > > > How many regions do you have? > > > > > > > > > > > > Amandeep Khurana > > > > Computer Science Graduate Student > > > > University of California, Santa Cruz > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:39 AM, john smith <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Amandeep , > > > > > > > > > > Thanks for the explanation . What is the default value to the num > of > > > maps > > > > ? > > > > > Is it not equal to the num of regions ? > > > > > > > > > > Right now I am running HBase in pseudo distributed mode . If I set > > num > > > of > > > > > map tasks to 100000 (some big num).. > > > > > > > > > > I get numSplits=1 > > > > > > > > > > If I dont set any thing .. numSplits =2; > > > > > > > > > > > > > > > Can you explain this. > > > > > > > > > > Thanks > > > > > j.S > > > > > > > > > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > If you set the number of map tasks as a higher number than the > > number > > > > of > > > > > > regions (I generally set it to 100000 or something like that), > the > > > > number > > > > > > of > > > > > > splits = number of regions. If you keep it lower, then it > combines > > > > > regions > > > > > > in a single split. > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > Computer Science Graduate Student > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith < > > [EMAIL PROTECTED]> > > > > > > wrote: > > > > > > > > > > > > > Amandeep, > > > > > > > > > > > > > > I guess that is not true ,.. See the explanation as in docs .. > > > > > > > > > > > > > > > > > > > > > "Splits are created in number equal to the smallest between > > > numSplits > > > > > and > > > > > > > the number of HRegion< > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > > > >s > > > > > > > in the table. If the number of splits is smaller than the > number > > of > > > > > > > HRegion< > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > > > >s > > > > > > > then splits are spanned across multiple > > > > > > > HRegion< > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html> > > > > > > >s > > > > > > > and are grouped the most evenly possible. In the case splits > are > > > > uneven > > > > > > the > > > > > > > bigger splits are placed first in the InputSplit array. " > > > > > > > > > > > > > > > > > > > > > depending on whether numSplits < (or >) num of regions .. it
|
|