|
|
Andreas Reiter 2011-06-06, 08:48
hello everybody
i'm trying to scan my hbase table for reporting purposes the cluster has 4 servers: - server1: namenode, secondary namenode, jobtracker, hbase master, zookeeper1 - server2: datanode, tasktracker, hbase regionserver, zookeeper2 - server3: datanode, tasktracker, hbase regionserver, zookeeper3 - server4: datanode, tasktracker, hbase regionserver everything seems to work properly versions: - hadoop-0.20.2-CDH3B4 - hbase-0.90.1-CDH3B4 - zookeeper-3.3.2-CDH3B4 at the moment our hbase table has 300000 entries
if i do a table scan over the hbase api (at the moment without a filter) ResultScanner scanner = table.getScanner(...);
it takes about 60 seconds to process, which is actually okey, because all records are processed be only one thread sequentially BUT it takes approximately the same time, if i do a scan over Map&Reduce job using TableInputFormat
i'm definitely doing something wrong, because the processing time is going up directly proportional to the number of rows. in my understanding, the big advantage of hadoop/hbase is, that huge numbers of entries can be processed in parallel and very fast
300k entries are not much, we expecting this number to be added hourly to our cluster, but the processing time is increasing, which is actually not acceptable
any one an idea, what i'm doing wrong?
best regards andre
Joey Echeverria 2011-06-06, 13:10
How many regions does your table have?
On Mon, Jun 6, 2011 at 4:48 AM, Andreas Reiter <[EMAIL PROTECTED]> wrote: > hello everybody > > i'm trying to scan my hbase table for reporting purposes > the cluster has 4 servers: > - server1: namenode, secondary namenode, jobtracker, hbase master, > zookeeper1 > - server2: datanode, tasktracker, hbase regionserver, zookeeper2 > - server3: datanode, tasktracker, hbase regionserver, zookeeper3 > - server4: datanode, tasktracker, hbase regionserver > everything seems to work properly > versions: > - hadoop-0.20.2-CDH3B4 > - hbase-0.90.1-CDH3B4 > - zookeeper-3.3.2-CDH3B4 > > > at the moment our hbase table has 300000 entries > > if i do a table scan over the hbase api (at the moment without a filter) > ResultScanner scanner = table.getScanner(...); > > it takes about 60 seconds to process, which is actually okey, because all > records are processed be only one thread sequentially > BUT it takes approximately the same time, if i do a scan over Map&Reduce job > using TableInputFormat > > i'm definitely doing something wrong, because the processing time is going > up directly proportional to the number of rows. > in my understanding, the big advantage of hadoop/hbase is, that huge numbers > of entries can be processed in parallel and very fast > > 300k entries are not much, we expecting this number to be added hourly to > our cluster, but the processing time is increasing, which is actually not > acceptable > > any one an idea, what i'm doing wrong? > > best regards > andre > >
-- Joseph Echeverria Cloudera, Inc. 443.305.9434
Christopher Tarnas 2011-06-06, 14:59
How many regions does your table have? If all of the data is still in one region then you will be rate limited by how fast that single region can be read. 3 nodes is also pretty small, the more nodes you have the better (at least 5 for dev and test and 10+ for production has been my experience).
Also, with only 4 servers you probably only need one zookeeper node; you will not be putting it under any serious load and you already have a SPOF on server1 (namenode, hbase master, etc).
-chris On Mon, Jun 6, 2011 at 3:48 AM, Andreas Reiter <[EMAIL PROTECTED]> wrote:
> hello everybody > > i'm trying to scan my hbase table for reporting purposes > the cluster has 4 servers: > - server1: namenode, secondary namenode, jobtracker, hbase master, > zookeeper1 > - server2: datanode, tasktracker, hbase regionserver, zookeeper2 > - server3: datanode, tasktracker, hbase regionserver, zookeeper3 > - server4: datanode, tasktracker, hbase regionserver > everything seems to work properly > versions: > - hadoop-0.20.2-CDH3B4 > - hbase-0.90.1-CDH3B4 > - zookeeper-3.3.2-CDH3B4 > > > at the moment our hbase table has 300000 entries > > if i do a table scan over the hbase api (at the moment without a filter) > ResultScanner scanner = table.getScanner(...); > > it takes about 60 seconds to process, which is actually okey, because all > records are processed be only one thread sequentially > BUT it takes approximately the same time, if i do a scan over Map&Reduce > job using TableInputFormat > > i'm definitely doing something wrong, because the processing time is going > up directly proportional to the number of rows. > in my understanding, the big advantage of hadoop/hbase is, that huge > numbers of entries can be processed in parallel and very fast > > 300k entries are not much, we expecting this number to be added hourly to > our cluster, but the processing time is increasing, which is actually not > acceptable > > any one an idea, what i'm doing wrong? > > best regards > andre > >
Himanshu Vashishtha 2011-06-06, 19:41
Also, How big is each row? Are you using scanner cache? You just fetching all the rows to the client, and?.
300k is not big (It seems you have 1'ish region, that could explain similar timing). Add more data and mapreduce will pick up!
Thanks, Himanshu
On Mon, Jun 6, 2011 at 8:59 AM, Christopher Tarnas <[EMAIL PROTECTED]> wrote:
> How many regions does your table have? If all of the data is still in one > region then you will be rate limited by how fast that single region can be > read. 3 nodes is also pretty small, the more nodes you have the better (at > least 5 for dev and test and 10+ for production has been my experience). > > Also, with only 4 servers you probably only need one zookeeper node; you > will not be putting it under any serious load and you already have a SPOF > on > server1 (namenode, hbase master, etc). > > -chris > > > On Mon, Jun 6, 2011 at 3:48 AM, Andreas Reiter <[EMAIL PROTECTED]> wrote: > > > hello everybody > > > > i'm trying to scan my hbase table for reporting purposes > > the cluster has 4 servers: > > - server1: namenode, secondary namenode, jobtracker, hbase master, > > zookeeper1 > > - server2: datanode, tasktracker, hbase regionserver, zookeeper2 > > - server3: datanode, tasktracker, hbase regionserver, zookeeper3 > > - server4: datanode, tasktracker, hbase regionserver > > everything seems to work properly > > versions: > > - hadoop-0.20.2-CDH3B4 > > - hbase-0.90.1-CDH3B4 > > - zookeeper-3.3.2-CDH3B4 > > > > > > at the moment our hbase table has 300000 entries > > > > if i do a table scan over the hbase api (at the moment without a filter) > > ResultScanner scanner = table.getScanner(...); > > > > it takes about 60 seconds to process, which is actually okey, because all > > records are processed be only one thread sequentially > > BUT it takes approximately the same time, if i do a scan over Map&Reduce > > job using TableInputFormat > > > > i'm definitely doing something wrong, because the processing time is > going > > up directly proportional to the number of rows. > > in my understanding, the big advantage of hadoop/hbase is, that huge > > numbers of entries can be processed in parallel and very fast > > > > 300k entries are not much, we expecting this number to be added hourly to > > our cluster, but the processing time is increasing, which is actually not > > acceptable > > > > any one an idea, what i'm doing wrong? > > > > best regards > > andre > > > > >
Andre Reiter 2011-06-06, 21:27
good question... i have no idea...
i did not define explicitly the number of regions for the table, how can i find out how many regions does my table have? how many ragions should the table have? how to change the number of the regions?
best regards andre
> ----- Original Message ----- > From: Joey Echeverria > Sent: Mon Jun 06 2011 15:10:29 GMT+0200 (CET) > To: > CC: > Subject: Re: full table scan
> How many regions does your table have? >
Doug Meil 2011-06-06, 21:30
Check the web console.
-----Original Message----- From: Andre Reiter [mailto:[EMAIL PROTECTED]] Sent: Monday, June 06, 2011 5:27 PM To: [EMAIL PROTECTED] Subject: Re: full table scan good question... i have no idea...
i did not define explicitly the number of regions for the table, how can i find out how many regions does my table have? how many ragions should the table have? how to change the number of the regions?
best regards andre
> ----- Original Message ----- > From: Joey Echeverria > Sent: Mon Jun 06 2011 15:10:29 GMT+0200 (CET) > To: > CC: > Subject: Re: full table scan
> How many regions does your table have? >
Andre Reiter 2011-06-06, 22:07
> Check the web console.
ah, ok thanks! at the port 60010 on the hbase master i actually found a web interface there was only one region, i played i bit with it, and executed the "Split" function twice. Now i have three regions, one on each hbase region server but still, the processing time did not change... i measured the same times as with only one region...
best regards andre
I think row counter would help you figure out the number of rows in each region. Refer to the following email thread, especially Stack's answer on Apr 1: row_counter map reduce job & 0.90.1 On Mon, Jun 6, 2011 at 3:07 PM, Andre Reiter <[EMAIL PROTECTED]> wrote:
> > Check the web console. >> > > ah, ok thanks! > at the port 60010 on the hbase master i actually found a web interface > there was only one region, i played i bit with it, and executed the "Split" > function twice. Now i have three regions, one on each hbase region server > but still, the processing time did not change... i measured the same times > as with only one region... > > best regards > andre > >
Andre Reiter 2011-06-07, 08:08
now i found out, that there are three regions, each on a particular region server (server2, server3, server4) the processing time is still >=60sec, which is not very impressive...
what can i do, to speed up the table scan
best regards andre Andreas Reiter wrote: > hello everybody > > i'm trying to scan my hbase table for reporting purposes > the cluster has 4 servers: > - server1: namenode, secondary namenode, jobtracker, hbase master, zookeeper1 > - server2: datanode, tasktracker, hbase regionserver, zookeeper2 > - server3: datanode, tasktracker, hbase regionserver, zookeeper3 > - server4: datanode, tasktracker, hbase regionserver > everything seems to work properly > versions: > - hadoop-0.20.2-CDH3B4 > - hbase-0.90.1-CDH3B4 > - zookeeper-3.3.2-CDH3B4 > > > at the moment our hbase table has 300000 entries > > if i do a table scan over the hbase api (at the moment without a filter) > ResultScanner scanner = table.getScanner(...); > > it takes about 60 seconds to process, which is actually okey, because all records are processed be only one thread sequentially > BUT it takes approximately the same time, if i do a scan over Map&Reduce job using TableInputFormat > > i'm definitely doing something wrong, because the processing time is going up directly proportional to the number of rows. > in my understanding, the big advantage of hadoop/hbase is, that huge numbers of entries can be processed in parallel and very fast > > 300k entries are not much, we expecting this number to be added hourly to our cluster, but the processing time is increasing, which is actually not acceptable > > any one an idea, what i'm doing wrong? > > best regards > andre > >
See http://hbase.apache.org/book/performance.htmlSt.Ack On Tue, Jun 7, 2011 at 1:08 AM, Andre Reiter <[EMAIL PROTECTED]> wrote: > now i found out, that there are three regions, each on a particular region > server (server2, server3, server4) > the processing time is still >=60sec, which is not very impressive... > > what can i do, to speed up the table scan > > best regards > andre > > > Andreas Reiter wrote: >> >> hello everybody >> >> i'm trying to scan my hbase table for reporting purposes >> the cluster has 4 servers: >> - server1: namenode, secondary namenode, jobtracker, hbase master, >> zookeeper1 >> - server2: datanode, tasktracker, hbase regionserver, zookeeper2 >> - server3: datanode, tasktracker, hbase regionserver, zookeeper3 >> - server4: datanode, tasktracker, hbase regionserver >> everything seems to work properly >> versions: >> - hadoop-0.20.2-CDH3B4 >> - hbase-0.90.1-CDH3B4 >> - zookeeper-3.3.2-CDH3B4 >> >> >> at the moment our hbase table has 300000 entries >> >> if i do a table scan over the hbase api (at the moment without a filter) >> ResultScanner scanner = table.getScanner(...); >> >> it takes about 60 seconds to process, which is actually okey, because all >> records are processed be only one thread sequentially >> BUT it takes approximately the same time, if i do a scan over Map&Reduce >> job using TableInputFormat >> >> i'm definitely doing something wrong, because the processing time is going >> up directly proportional to the number of rows. >> in my understanding, the big advantage of hadoop/hbase is, that huge >> numbers of entries can be processed in parallel and very fast >> >> 300k entries are not much, we expecting this number to be added hourly to >> our cluster, but the processing time is increasing, which is actually not >> acceptable >> >> any one an idea, what i'm doing wrong? >> >> best regards >> andre >> >> > > >
Andre Reiter 2011-06-08, 04:43
cool, just one change scan.setCaching(1000); reduced the processing time of my MR job from 60sec to 10sec ! nice :-) PS: now looking for other optimizations... Stack wrote: > See http://hbase.apache.org/book/performance.html> St.Ack >
Jean-Daniel Cryans 2011-06-10, 18:46
You expect a MapReduce job to be faster than a Scan on small data, your expectation is wrong. There's a minimal cost to every MR job, which is of a few seconds, and you can't go around it. What other people have been trying to tell you is that you don't have enough data to benefit from the parallel execution advantages of Hadoop and HBase. J-D On Wed, Jun 8, 2011 at 4:43 AM, Andre Reiter <[EMAIL PROTECTED]> wrote: > cool, just one change > > scan.setCaching(1000); > > reduced the processing time of my MR job from 60sec to 10sec ! > nice :-) > > PS: now looking for other optimizations... > > > > Stack wrote: >> >> See http://hbase.apache.org/book/performance.html>> St.Ack >> > >
Andre Reiter 2011-06-11, 08:36
Jean-Daniel Cryans wrote: > You expect a MapReduce job to be faster than a Scan on small data, > your expectation is wrong.
never expected a MR job to be faster for every context
> There's a minimal cost to every MR job, which is of a few seconds, and > you can't go around it.
for sure there is an overhead for MR job, and a few seconds are OK, but not a whole minute...
so what time can be expected for processing a full scan of i.e. 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers?
i'm just wondering, if its worth to run the full scan only once a day, and to persist the results i hoped to be able to process it on demand, but if it takes too much time, its not acceptable
andre
On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter <[EMAIL PROTECTED]> wrote: > so what time can be expected for processing a full scan of i.e. > 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers? >
I don't think three servers and 1M rows (only) enough data and resources for contrast and compare. Multiply data by 100. Servers by three or four (IMO).
St.Ack
Ted Dunning 2011-06-12, 09:31
He said 10^9. Easy to misread.
On Sat, Jun 11, 2011 at 6:41 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter <[EMAIL PROTECTED]> wrote: > > so what time can be expected for processing a full scan of i.e. > > 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers? > > > > I don't think three servers and 1M rows (only) enough data and > resources for contrast and compare. Multiply data by 100. Servers by > three or four (IMO). > > St.Ack >
Thanks Ted. I misread
On Jun 12, 2011, at 2:31, Ted Dunning <[EMAIL PROTECTED]> wrote:
> He said 10^9. Easy to misread. > > On Sat, Jun 11, 2011 at 6:41 PM, Stack <[EMAIL PROTECTED]> wrote: > >> On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter <[EMAIL PROTECTED]> wrote: >>> so what time can be expected for processing a full scan of i.e. >>> 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers? >>> >> >> I don't think three servers and 1M rows (only) enough data and >> resources for contrast and compare. Multiply data by 100. Servers by >> three or four (IMO). >> >> St.Ack >>
Andre Reiter 2011-06-21, 05:13
sorry guys, still the same problem... my MR jobs are running not very fast...
the job org.apache.hadoop.hbase.mapreduce.RowCounter took 13 minutes to complete while we do not have much rows, just 3223543 at the moment we have 3 region servers, while the table is split over 13 regions on that 3 servers
i just can not believe, its that slow...
what is going wrong?
Sounds like you are doing about 5k rows/second per server.
What size rows? How many column families? What kinda of hardware?
St.Ack
On Mon, Jun 20, 2011 at 10:13 PM, Andre Reiter <[EMAIL PROTECTED]> wrote: > sorry guys, > still the same problem... my MR jobs are running not very fast... > > the job org.apache.hadoop.hbase.mapreduce.RowCounter took 13 minutes to > complete while we do not have much rows, just 3223543 > at the moment we have 3 region servers, while the table is split over 13 > regions on that 3 servers > > i just can not believe, its that slow... > > what is going wrong? > >
Andre Reiter 2011-06-21, 07:02
Hi Stack,
thanks a lot for the reply each row is about 2k in average, there are only 2 families
hardware:
CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz) disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive) memory: 2 GB network: 1 Gbps Ethernet schrieb Stack: > Sounds like you are doing about 5k rows/second per server. > > What size rows? How many column families? What kinda of hardware? > > St.Ack
Andre:
As per Ted in the other thread, because you have 2GB only, are you sure that you are not swapping? Swapping will cause all to slow down.
St.Ack
On Tue, Jun 21, 2011 at 12:02 AM, Andre Reiter <[EMAIL PROTECTED]> wrote: > Hi Stack, > > thanks a lot for the reply > each row is about 2k in average, there are only 2 families > > hardware: > > CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz) > disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive) > memory: 2 GB > network: 1 Gbps Ethernet > > > schrieb Stack: >> >> Sounds like you are doing about 5k rows/second per server. >> >> What size rows? How many column families? What kinda of hardware? >> >> St.Ack > >
|
|