|
Yi Liang
2011-12-27, 10:53
Lars H
2011-12-27, 16:31
Ramkrishna S Vasudevan
2011-12-28, 02:20
Gaojinchao
2011-12-28, 08:21
Yi Liang
2011-12-29, 01:54
Yi Liang
2011-12-29, 04:26
Yi Liang
2011-12-29, 04:28
|
-
Read speed down after long runningYi Liang 2011-12-27, 10:53
Hi all,
We're running hbase 0.90.3 for one read intensive application. We find after long running(2 weeks or 1 month or longer), the read speed will become much lower. For example, a get_rows operation of thrift to fetch 20 rows (about 4k size every row) could take >2 second, sometimes even >5 seconds. When it happens, we can see cpu_wio keeps at about 10. But if we restart hbase(only master and regionservers) with stop-hbase.sh and start-hbase.sh, we can see the read speed back to normal immediately, which is <200 ms for every get_rows operation, and the cpu_wio drops to about 2. When the problem appears, there's no exception in logs, and no flush/compaction, nothing abnormal except a few warning logs sometimes like below: 2011-12-27 15:50:20,307 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; editcount=1, len~=9.8k Our cluster has 10 region servers, each with 25g heap size, 64% of which used for cache. The're some m/r jobs keep running in another cluster to feed data into the this hbase. Every night, we do flush and major compaction. Usually there's no flush or compaction in the daytime. Could anybody explain why the read speed could become lower after long running, and why it back to normal immediately after restarting hbase? Every advice will be highly appreciated. Thanks, Yi
-
Re: Read speed down after long runningLars H 2011-12-27, 16:31
When you restart HBase are you also restarting the client process?
Are you using HBaseAdmin.tableExists? If so you might be running into HBASE-5073 -- Lars Yi Liang <[EMAIL PROTECTED]> schrieb: >Hi all, > >We're running hbase 0.90.3 for one read intensive application. > >We find after long running(2 weeks or 1 month or longer), the read speed >will become much lower. > >For example, a get_rows operation of thrift to fetch 20 rows (about 4k size >every row) could take >2 second, sometimes even >5 seconds. When it >happens, we can see cpu_wio keeps at about 10. > >But if we restart hbase(only master and regionservers) with stop-hbase.sh >and start-hbase.sh, we can see the read speed back to normal immediately, >which is <200 ms for every get_rows operation, and the cpu_wio drops to >about 2. > >When the problem appears, there's no exception in logs, and no >flush/compaction, nothing abnormal except a few warning logs sometimes like >below: >2011-12-27 15:50:20,307 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: >IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; >editcount=1, len~=9.8k > >Our cluster has 10 region servers, each with 25g heap size, 64% of which >used for cache. The're some m/r jobs keep running in another cluster to >feed data into the this hbase. Every night, we do flush and major >compaction. Usually there's no flush or compaction in the daytime. > >Could anybody explain why the read speed could become lower after long >running, and why it back to normal immediately after restarting hbase? > >Every advice will be highly appreciated. > >Thanks, >Yi
-
RE: Read speed down after long runningRamkrishna S Vasudevan 2011-12-28, 02:20
As Lars mentioned admin apis like flush and compact will also slow down the client.
As part of restart of HBase cluster, clients are also restarted? Regards Ram -----Original Message----- From: Lars H [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 27, 2011 10:02 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Read speed down after long running When you restart HBase are you also restarting the client process? Are you using HBaseAdmin.tableExists? If so you might be running into HBASE-5073 -- Lars Yi Liang <[EMAIL PROTECTED]> schrieb: >Hi all, > >We're running hbase 0.90.3 for one read intensive application. > >We find after long running(2 weeks or 1 month or longer), the read speed >will become much lower. > >For example, a get_rows operation of thrift to fetch 20 rows (about 4k size >every row) could take >2 second, sometimes even >5 seconds. When it >happens, we can see cpu_wio keeps at about 10. > >But if we restart hbase(only master and regionservers) with stop-hbase.sh >and start-hbase.sh, we can see the read speed back to normal immediately, >which is <200 ms for every get_rows operation, and the cpu_wio drops to >about 2. > >When the problem appears, there's no exception in logs, and no >flush/compaction, nothing abnormal except a few warning logs sometimes like >below: >2011-12-27 15:50:20,307 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: >IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; >editcount=1, len~=9.8k > >Our cluster has 10 region servers, each with 25g heap size, 64% of which >used for cache. The're some m/r jobs keep running in another cluster to >feed data into the this hbase. Every night, we do flush and major >compaction. Usually there's no flush or compaction in the daytime. > >Could anybody explain why the read speed could become lower after long >running, and why it back to normal immediately after restarting hbase? > >Every advice will be highly appreciated. > >Thanks, >Yi
-
Re: Read speed down after long runningGaojinchao 2011-12-28, 08:21
I think you need check the threaddump(Client and RS) and resources(memory, IO and network) of your cluster.
-----邮件原件----- 发件人: Lars H [mailto:[EMAIL PROTECTED]] 发送时间: 2011年12月28日 0:32 收件人: [EMAIL PROTECTED] 抄送: [EMAIL PROTECTED] 主题: Re: Read speed down after long running When you restart HBase are you also restarting the client process? Are you using HBaseAdmin.tableExists? If so you might be running into HBASE-5073 -- Lars Yi Liang <[EMAIL PROTECTED]> schrieb: >Hi all, > >We're running hbase 0.90.3 for one read intensive application. > >We find after long running(2 weeks or 1 month or longer), the read speed >will become much lower. > >For example, a get_rows operation of thrift to fetch 20 rows (about 4k size >every row) could take >2 second, sometimes even >5 seconds. When it >happens, we can see cpu_wio keeps at about 10. > >But if we restart hbase(only master and regionservers) with stop-hbase.sh >and start-hbase.sh, we can see the read speed back to normal immediately, >which is <200 ms for every get_rows operation, and the cpu_wio drops to >about 2. > >When the problem appears, there's no exception in logs, and no >flush/compaction, nothing abnormal except a few warning logs sometimes like >below: >2011-12-27 15:50:20,307 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: >IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; >editcount=1, len~=9.8k > >Our cluster has 10 region servers, each with 25g heap size, 64% of which >used for cache. The're some m/r jobs keep running in another cluster to >feed data into the this hbase. Every night, we do flush and major >compaction. Usually there's no flush or compaction in the daytime. > >Could anybody explain why the read speed could become lower after long >running, and why it back to normal immediately after restarting hbase? > >Every advice will be highly appreciated. > >Thanks, >Yi
-
Re: Read speed down after long runningYi Liang 2011-12-29, 01:54
Lars, Ram:
I don't restart client processes(in my case, they're thrift servers), I only restart the master and rs. Do you mean I should also restart the thrift servers? I'm now checking the code of thrift server, it seems that it does use HBaseAdmin.tableExists somewhere like createTable() and deleteTable(). Jinchao: I don't see any clue when checking rs with jstack, which states/threads should I check more carefully?. When the problem occurs, we see bigger IO than usual, the memory and network look ok. Thank you for your suggestions! Yi On Wed, Dec 28, 2011 at 4:21 PM, Gaojinchao <[EMAIL PROTECTED]> wrote: > I think you need check the threaddump(Client and RS) and resources(memory, > IO and network) of your cluster. > > -----邮件原件----- > 发件人: Lars H [mailto:[EMAIL PROTECTED]] > 发送时间: 2011年12月28日 0:32 > 收件人: [EMAIL PROTECTED] > 抄送: [EMAIL PROTECTED] > 主题: Re: Read speed down after long running > > When you restart HBase are you also restarting the client process? > Are you using HBaseAdmin.tableExists? > If so you might be running into HBASE-5073 > > -- Lars > > Yi Liang <[EMAIL PROTECTED]> schrieb: > > >Hi all, > > > >We're running hbase 0.90.3 for one read intensive application. > > > >We find after long running(2 weeks or 1 month or longer), the read speed > >will become much lower. > > > >For example, a get_rows operation of thrift to fetch 20 rows (about 4k > size > >every row) could take >2 second, sometimes even >5 seconds. When it > >happens, we can see cpu_wio keeps at about 10. > > > >But if we restart hbase(only master and regionservers) with stop-hbase.sh > >and start-hbase.sh, we can see the read speed back to normal immediately, > >which is <200 ms for every get_rows operation, and the cpu_wio drops to > >about 2. > > > >When the problem appears, there's no exception in logs, and no > >flush/compaction, nothing abnormal except a few warning logs sometimes > like > >below: > >2011-12-27 15:50:20,307 WARN > org.apache.hadoop.hbase.regionserver.wal.HLog: > >IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; > >editcount=1, len~=9.8k > > > >Our cluster has 10 region servers, each with 25g heap size, 64% of which > >used for cache. The're some m/r jobs keep running in another cluster to > >feed data into the this hbase. Every night, we do flush and major > >compaction. Usually there's no flush or compaction in the daytime. > > > >Could anybody explain why the read speed could become lower after long > >running, and why it back to normal immediately after restarting hbase? > > > >Every advice will be highly appreciated. > > > >Thanks, > >Yi >
-
Re: Read speed down after long runningYi Liang 2011-12-29, 04:26
Sorry, I forgot there's another kind of client process, the Java MapReduce
jobs to write data. I don't restart them either. They're usually short-lived. I think either the M/R jobs or thrift servers would execute the HBaseAdmin.tableExists, because we use them only to do get or put operations. The M/R jobs are used to put and get data, the thrift servers are used to get rows of data. All tables were created once, and never altered/deleted any more. 2011/12/29 Yi Liang <[EMAIL PROTECTED]> > Lars, Ram: > > I don't restart client processes(in my case, they're thrift servers), I > only restart the master and rs. Do you mean I should also restart the > thrift servers? > > I'm now checking the code of thrift server, it seems that it does use HBaseAdmin.tableExists > somewhere like createTable() and deleteTable(). > > Jinchao: > I don't see any clue when checking rs with jstack, which states/threads > should I check more carefully?. When the problem occurs, we see bigger IO > than usual, the memory and network look ok. > > Thank you for your suggestions! > Yi > > On Wed, Dec 28, 2011 at 4:21 PM, Gaojinchao <[EMAIL PROTECTED]> wrote: > >> I think you need check the threaddump(Client and RS) and >> resources(memory, IO and network) of your cluster. >> >> -----邮件原件----- >> 发件人: Lars H [mailto:[EMAIL PROTECTED]] >> 发送时间: 2011年12月28日 0:32 >> 收件人: [EMAIL PROTECTED] >> 抄送: [EMAIL PROTECTED] >> 主题: Re: Read speed down after long running >> >> When you restart HBase are you also restarting the client process? >> Are you using HBaseAdmin.tableExists? >> If so you might be running into HBASE-5073 >> >> -- Lars >> >> Yi Liang <[EMAIL PROTECTED]> schrieb: >> >> >Hi all, >> > >> >We're running hbase 0.90.3 for one read intensive application. >> > >> >We find after long running(2 weeks or 1 month or longer), the read speed >> >will become much lower. >> > >> >For example, a get_rows operation of thrift to fetch 20 rows (about 4k >> size >> >every row) could take >2 second, sometimes even >5 seconds. When it >> >happens, we can see cpu_wio keeps at about 10. >> > >> >But if we restart hbase(only master and regionservers) with stop-hbase.sh >> >and start-hbase.sh, we can see the read speed back to normal immediately, >> >which is <200 ms for every get_rows operation, and the cpu_wio drops to >> >about 2. >> > >> >When the problem appears, there's no exception in logs, and no >> >flush/compaction, nothing abnormal except a few warning logs sometimes >> like >> >below: >> >2011-12-27 15:50:20,307 WARN >> org.apache.hadoop.hbase.regionserver.wal.HLog: >> >IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; >> >editcount=1, len~=9.8k >> > >> >Our cluster has 10 region servers, each with 25g heap size, 64% of which >> >used for cache. The're some m/r jobs keep running in another cluster to >> >feed data into the this hbase. Every night, we do flush and major >> >compaction. Usually there's no flush or compaction in the daytime. >> > >> >Could anybody explain why the read speed could become lower after long >> >running, and why it back to normal immediately after restarting hbase? >> > >> >Every advice will be highly appreciated. >> > >> >Thanks, >> >Yi >> > >
-
Re: Read speed down after long runningYi Liang 2011-12-29, 04:28
Excuse me for my poor english...
I meant neither the M/R jobs nor thrift servers would execute the HBaseAdmin.tableExists... 2011/12/29 Yi Liang <[EMAIL PROTECTED]> > Sorry, I forgot there's another kind of client process, the Java MapReduce > jobs to write data. I don't restart them either. They're usually > short-lived. > > I think either the M/R jobs or thrift servers would execute the > HBaseAdmin.tableExists, because we use them only to do get or put > operations. The M/R jobs are used to put and get data, the thrift servers > are used to get rows of data. All tables were created once, and never > altered/deleted any more. > > > 2011/12/29 Yi Liang <[EMAIL PROTECTED]> > >> Lars, Ram: >> >> I don't restart client processes(in my case, they're thrift servers), I >> only restart the master and rs. Do you mean I should also restart the >> thrift servers? >> >> I'm now checking the code of thrift server, it seems that it does use HBaseAdmin.tableExists >> somewhere like createTable() and deleteTable(). >> >> Jinchao: >> I don't see any clue when checking rs with jstack, which states/threads >> should I check more carefully?. When the problem occurs, we see bigger IO >> than usual, the memory and network look ok. >> >> Thank you for your suggestions! >> Yi >> >> On Wed, Dec 28, 2011 at 4:21 PM, Gaojinchao <[EMAIL PROTECTED]>wrote: >> >>> I think you need check the threaddump(Client and RS) and >>> resources(memory, IO and network) of your cluster. >>> >>> -----邮件原件----- >>> 发件人: Lars H [mailto:[EMAIL PROTECTED]] >>> 发送时间: 2011年12月28日 0:32 >>> 收件人: [EMAIL PROTECTED] >>> 抄送: [EMAIL PROTECTED] >>> 主题: Re: Read speed down after long running >>> >>> When you restart HBase are you also restarting the client process? >>> Are you using HBaseAdmin.tableExists? >>> If so you might be running into HBASE-5073 >>> >>> -- Lars >>> >>> Yi Liang <[EMAIL PROTECTED]> schrieb: >>> >>> >Hi all, >>> > >>> >We're running hbase 0.90.3 for one read intensive application. >>> > >>> >We find after long running(2 weeks or 1 month or longer), the read speed >>> >will become much lower. >>> > >>> >For example, a get_rows operation of thrift to fetch 20 rows (about 4k >>> size >>> >every row) could take >2 second, sometimes even >5 seconds. When it >>> >happens, we can see cpu_wio keeps at about 10. >>> > >>> >But if we restart hbase(only master and regionservers) with >>> stop-hbase.sh >>> >and start-hbase.sh, we can see the read speed back to normal >>> immediately, >>> >which is <200 ms for every get_rows operation, and the cpu_wio drops to >>> >about 2. >>> > >>> >When the problem appears, there's no exception in logs, and no >>> >flush/compaction, nothing abnormal except a few warning logs sometimes >>> like >>> >below: >>> >2011-12-27 15:50:20,307 WARN >>> org.apache.hadoop.hbase.regionserver.wal.HLog: >>> >IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; >>> >editcount=1, len~=9.8k >>> > >>> >Our cluster has 10 region servers, each with 25g heap size, 64% of which >>> >used for cache. The're some m/r jobs keep running in another cluster to >>> >feed data into the this hbase. Every night, we do flush and major >>> >compaction. Usually there's no flush or compaction in the daytime. >>> > >>> >Could anybody explain why the read speed could become lower after long >>> >running, and why it back to normal immediately after restarting hbase? >>> > >>> >Every advice will be highly appreciated. >>> > >>> >Thanks, >>> >Yi >>> >> >> > |