-Re: full table scan
Himanshu Vashishtha 2011-06-06, 19:41
How big is each row? Are you using scanner cache? You just fetching all the
rows to the client, and?.
300k is not big (It seems you have 1'ish region, that could explain similar
timing). Add more data and mapreduce will pick up!
On Mon, Jun 6, 2011 at 8:59 AM, Christopher Tarnas <[EMAIL PROTECTED]> wrote:
> How many regions does your table have? If all of the data is still in one
> region then you will be rate limited by how fast that single region can be
> read. 3 nodes is also pretty small, the more nodes you have the better (at
> least 5 for dev and test and 10+ for production has been my experience).
> Also, with only 4 servers you probably only need one zookeeper node; you
> will not be putting it under any serious load and you already have a SPOF
> server1 (namenode, hbase master, etc).
> On Mon, Jun 6, 2011 at 3:48 AM, Andreas Reiter <[EMAIL PROTECTED]> wrote:
> > hello everybody
> > i'm trying to scan my hbase table for reporting purposes
> > the cluster has 4 servers:
> > - server1: namenode, secondary namenode, jobtracker, hbase master,
> > zookeeper1
> > - server2: datanode, tasktracker, hbase regionserver, zookeeper2
> > - server3: datanode, tasktracker, hbase regionserver, zookeeper3
> > - server4: datanode, tasktracker, hbase regionserver
> > everything seems to work properly
> > versions:
> > - hadoop-0.20.2-CDH3B4
> > - hbase-0.90.1-CDH3B4
> > - zookeeper-3.3.2-CDH3B4
> > at the moment our hbase table has 300000 entries
> > if i do a table scan over the hbase api (at the moment without a filter)
> > ResultScanner scanner = table.getScanner(...);
> > it takes about 60 seconds to process, which is actually okey, because all
> > records are processed be only one thread sequentially
> > BUT it takes approximately the same time, if i do a scan over Map&Reduce
> > job using TableInputFormat
> > i'm definitely doing something wrong, because the processing time is
> > up directly proportional to the number of rows.
> > in my understanding, the big advantage of hadoop/hbase is, that huge
> > numbers of entries can be processed in parallel and very fast
> > 300k entries are not much, we expecting this number to be added hourly to
> > our cluster, but the processing time is increasing, which is actually not
> > acceptable
> > any one an idea, what i'm doing wrong?
> > best regards
> > andre