Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> getSplits() in TableInputFormatBase


Copy link to this message
-
Re: getSplits() in TableInputFormatBase
3 tables? are you counting root and meta also?
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
On Sun, Apr 11, 2010 at 1:57 AM, john smith <[EMAIL PROTECTED]> wrote:

> From the web interface...
>
>
> number of regions =5
> number of tables = 3
>
> Thanks
>
>
> On Sun, Apr 11, 2010 at 2:23 PM, Amandeep Khurana <[EMAIL PROTECTED]>
> wrote:
>
> > How many regions do you have?
> >
> >
> > Amandeep Khurana
> > Computer Science Graduate Student
> > University of California, Santa Cruz
> >
> >
> > On Sun, Apr 11, 2010 at 1:39 AM, john smith <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Amandeep ,
> > >
> > > Thanks for the explanation . What is the default value to the num of
> maps
> > ?
> > > Is it not equal to the num of regions ?
> > >
> > > Right now I am running HBase in pseudo distributed mode . If I set num
> of
> > > map tasks to 100000 (some big num)..
> > >
> > > I get numSplits=1
> > >
> > > If I dont set any thing .. numSplits =2;
> > >
> > >
> > > Can you explain this.
> > >
> > > Thanks
> > > j.S
> > >
> > > On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > If you set the number of map tasks as a higher number than the number
> > of
> > > > regions (I generally set it to 100000 or something like that), the
> > number
> > > > of
> > > > splits = number of regions. If you keep it lower, then it combines
> > > regions
> > > > in a single split.
> > > >
> > > >
> > > > Amandeep Khurana
> > > > Computer Science Graduate Student
> > > > University of California, Santa Cruz
> > > >
> > > >
> > > > On Sun, Apr 11, 2010 at 1:15 AM, john smith <[EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > Amandeep,
> > > > >
> > > > > I guess that is not true ,.. See the explanation as in docs ..
> > > > >
> > > > >
> > > > > "Splits are created in number equal to the smallest between
> numSplits
> > > and
> > > > > the number of HRegion<
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > >s
> > > > > in the table. If the number of splits is smaller than the number of
> > > > > HRegion<
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > >s
> > > > > then splits are spanned across multiple
> > > > > HRegion<
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > > > > >s
> > > > > and are grouped the most evenly possible. In the case splits are
> > uneven
> > > > the
> > > > > bigger splits are placed first in the InputSplit array.  "
> > > > >
> > > > >
> > > > > depending on whether numSplits < (or >)  num of regions .. it
> choses
> > > real
> > > > > number of splits and the same is done in the code
> > > > >
> > > > > // Code
> > > > >  int realNumSplits = numSplits > startKeys.length?
> startKeys.length:
> > > > > numSplits;
> > > > >
> > > > > Here startKeys.length is the number of regions...
> > > > >
> > > > > Am I true?
> > > > >
> > > > > Thanks
> > > > > j.S
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <
> [EMAIL PROTECTED]>
> > > > > wrote:
> > > > >
> > > > > > The number of splits is equal to the number of regions...
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <
> > [EMAIL PROTECTED]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi ,
> > > > > > >
> > > > > > > In the method  "public org.apache.hadoop.mapred.InputSplit[]
> > > > > *getSplits*
> > > > > > > (org.apache.hadoop.mapred.JobConf job,
> > > > > > >
> > > > > > >                                                       int
> > > numSplits)
> > > > "
> > > > > > >
> > > > > > > how is the "numSplits" decided ? I've seen differnt values of
> > > > > > > numSplits for different MR jobs . Any reason for this ?
> > > > > > >
> > > > > > > Also what if I ignore numsplits and always split at region
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB