|
Eran Kutner
2011-03-22, 17:39
Jean-Daniel Cryans
2011-03-22, 18:22
Eran Kutner
2011-03-22, 18:59
Jean-Daniel Cryans
2011-03-22, 19:01
Eran Kutner
2011-03-22, 19:37
Jean-Daniel Cryans
2011-03-22, 19:46
Eran Kutner
2011-03-23, 12:04
Jean-Daniel Cryans
2011-03-23, 16:52
Jean-Daniel Cryans
2011-03-23, 22:20
Jean-Daniel Cryans
2011-03-23, 22:28
Eran Kutner
2011-03-24, 08:13
Jean-Daniel Cryans
2011-03-24, 20:52
Eran Kutner
2011-03-24, 22:16
Jean-Daniel Cryans
2011-03-25, 00:02
Eran Kutner
2011-03-25, 19:26
Eran Kutner
2011-03-27, 10:17
Jean-Daniel Cryans
2011-03-27, 17:05
Jean-Daniel Cryans
2011-03-28, 18:29
Eran Kutner
2011-03-28, 19:59
Jean-Daniel Cryans
2011-03-28, 21:43
Eran Kutner
2011-03-29, 11:29
|
-
Region server crashes when using replicationEran Kutner 2011-03-22, 17:39
Hi,
I'm trying to use replication between two HBase clusters and I'm encountering all kinds of crashes and weird behavior. First, it seems that starting a region server when the peer ZKs are not available will cause the server to fail to start: 2011-03-22 08:31:56,647 INFO org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication is now started 2011-03-22 08:31:56,668 WARN org.apache.hadoop.hbase.zookeeper.ZKConfig: java.net.UnknownHostException: haddop2-zk3 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) at java.net.InetAddress.getAllByName0(InetAddress.java:1154) at java.net.InetAddress.getAllByName(InetAddress.java:1084) at java.net.InetAddress.getAllByName(InetAddress.java:1020) at java.net.InetAddress.getByName(InetAddress.java:970) at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206) at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563) at java.lang.Thread.run(Thread.java:662) 2011-03-22 08:31:56,669 WARN org.apache.hadoop.hbase.zookeeper.ZKConfig: java.net.UnknownHostException: haddop2-zk2 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) at java.net.InetAddress.getAllByName0(InetAddress.java:1154) at java.net.InetAddress.getAllByName(InetAddress.java:1084) at java.net.InetAddress.getAllByName(InetAddress.java:1020) at java.net.InetAddress.getByName(InetAddress.java:970) at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206) at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875) at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563) at java.lang.Thread.run(Thread.java:662) 2011-03-22 08:31:56,669 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=haddop2-zk3:2181,haddop2-zk2:2181,hadoop2-zk1:2181 sessionTimeout=180000 watcher=connection to cluster: 1 2011-03-22 08:31:56,670 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Failed initialization 2011-03-22 08:31:56,670 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init java.net.UnknownHostException: haddop2-zk3 at java.net.InetAddress.getAllByName0(InetAddress.java:1158) at java.net.InetAddress.getAllByName(InetAddress.java:1084) at java.net.InetAddress.getAllByName(InetAddress.java:1020) at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386) at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377) at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-22, 18:22
First issue: UnknownHostException is unforgiving, your machines need
to be able to talk to haddop2-zk3 (is that a typo?) and it seems that at least that one can't. The reason the machine dies is that we usually try to "fail fast" in HBase. Second issue: There's not enough information, all I see is a region server shutting down and the reason why is probably before that. Third issue: https://issues.apache.org/jira/browse/HBASE-3664 Fourth issue: it's now 3 minutes in 0.90 for the timeout to happen. J-D On Tue, Mar 22, 2011 at 10:39 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Hi, > I'm trying to use replication between two HBase clusters and I'm > encountering all kinds of crashes and weird behavior. > > First, it seems that starting a region server when the peer ZKs are > not available will cause the server to fail to start: > > 2011-03-22 08:31:56,647 INFO > org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication > is now started > 2011-03-22 08:31:56,668 WARN > org.apache.hadoop.hbase.zookeeper.ZKConfig: > java.net.UnknownHostException: haddop2-zk3 > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) > at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) > at java.net.InetAddress.getAllByName0(InetAddress.java:1154) > at java.net.InetAddress.getAllByName(InetAddress.java:1084) > at java.net.InetAddress.getAllByName(InetAddress.java:1020) > at java.net.InetAddress.getByName(InetAddress.java:970) > at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206) > at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113) > at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) > at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) > at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) > at org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) > at org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) > at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) > at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875) > at org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472) > at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563) > at java.lang.Thread.run(Thread.java:662) > > 2011-03-22 08:31:56,669 WARN > org.apache.hadoop.hbase.zookeeper.ZKConfig: > java.net.UnknownHostException: haddop2-zk2 > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) > at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) > at java.net.InetAddress.getAllByName0(InetAddress.java:1154) > at java.net.InetAddress.getAllByName(InetAddress.java:1084) > at java.net.InetAddress.getAllByName(InetAddress.java:1020) > at java.net.InetAddress.getByName(InetAddress.java:970) > at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206) > at org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113) > at org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) > at org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
-
Re: Region server crashes when using replicationEran Kutner 2011-03-22, 18:59
Thanks, J-D.
As for the first issue, why does this behavior make sense? What happens when the connection between the two cluster fails? Will the region servers of the primary fail as well? or at least won't be able to start? Seems very radical. Regarding the second issue, I didn't see anything else in the logs, it just seemed like it decided to shutdown, but maybe I missed it. I will try to reproduce that and let you know if I succeed. Regarding the timeout to detect a failed server, 3 minutes sounds like a very long time for a region server to be down. Obviously, during that time the data owned by that server is inaccessible. Is there a reason for this long timeout? Can it be configured? -eran On Tue, Mar 22, 2011 at 20:22, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > > First issue: UnknownHostException is unforgiving, your machines need > to be able to talk to haddop2-zk3 (is that a typo?) and it seems that > at least that one can't. The reason the machine dies is that we > usually try to "fail fast" in HBase. > > Second issue: There's not enough information, all I see is a region > server shutting down and the reason why is probably before that. > > Third issue: https://issues.apache.org/jira/browse/HBASE-3664 > > Fourth issue: it's now 3 minutes in 0.90 for the timeout to happen. > > J-D >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-22, 19:01
Inline.
J-D On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Thanks, J-D. > As for the first issue, why does this behavior make sense? What happens when > the connection between the two cluster fails? Will the region servers of the > primary fail as well? or at least won't be able to start? Seems very > radical. The DNS entry should remain, so you won't get UnknownHostException but ConnectionRefused instead. But that's a different issue: HBASE-3130 > > Regarding the second issue, I didn't see anything else in the logs, it just > seemed like it decided to shutdown, but maybe I missed it. I will try to > reproduce that and let you know if I succeed. That'd be nice :) > > Regarding the timeout to detect a failed server, 3 minutes sounds like a > very long time for a region server to be down. Obviously, during that time > the data owned by that server is inaccessible. Is there a reason for this > long timeout? Can it be configured? > We set it that high for people that try to push too much data to clusters that are too small / badly configured and then end up with crazy garbage collections. Have fun reading this serie of blog posts: http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ Please also see the book about this configuration: http://hbase.apache.org/book.html#recommended_configurations
-
Re: Region server crashes when using replicationEran Kutner 2011-03-22, 19:37
Actually, it will probably be connection timeout, not connection
refused when there is no connection between the two clusters. Is there a workaround I can implement now for HBASE-3664, can I write something in ZK so the server has an old entry to delete and is happy with it? -eran On Tue, Mar 22, 2011 at 21:01, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Inline. > > J-D > > On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> Thanks, J-D. >> As for the first issue, why does this behavior make sense? What happens when >> the connection between the two cluster fails? Will the region servers of the >> primary fail as well? or at least won't be able to start? Seems very >> radical. > > The DNS entry should remain, so you won't get UnknownHostException but > ConnectionRefused instead. But that's a different issue: HBASE-3130 > >> >> Regarding the second issue, I didn't see anything else in the logs, it just >> seemed like it decided to shutdown, but maybe I missed it. I will try to >> reproduce that and let you know if I succeed. > > That'd be nice :) > >> >> Regarding the timeout to detect a failed server, 3 minutes sounds like a >> very long time for a region server to be down. Obviously, during that time >> the data owned by that server is inaccessible. Is there a reason for this >> long timeout? Can it be configured? >> > > We set it that high for people that try to push too much data to > clusters that are too small / badly configured and then end up with > crazy garbage collections. Have fun reading this serie of blog posts: > http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ > > Please also see the book about this configuration: > http://hbase.apache.org/book.html#recommended_configurations >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-22, 19:46
You can apply the patch that I included there and that I also
committed to the 0.90 branch. J-D On Tue, Mar 22, 2011 at 12:37 PM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Actually, it will probably be connection timeout, not connection > refused when there is no connection between the two clusters. > > Is there a workaround I can implement now for HBASE-3664, can I write > something in ZK so the server has an old entry to delete and is happy > with it? > > -eran > > > > > On Tue, Mar 22, 2011 at 21:01, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: >> Inline. >> >> J-D >> >> On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >>> Thanks, J-D. >>> As for the first issue, why does this behavior make sense? What happens when >>> the connection between the two cluster fails? Will the region servers of the >>> primary fail as well? or at least won't be able to start? Seems very >>> radical. >> >> The DNS entry should remain, so you won't get UnknownHostException but >> ConnectionRefused instead. But that's a different issue: HBASE-3130 >> >>> >>> Regarding the second issue, I didn't see anything else in the logs, it just >>> seemed like it decided to shutdown, but maybe I missed it. I will try to >>> reproduce that and let you know if I succeed. >> >> That'd be nice :) >> >>> >>> Regarding the timeout to detect a failed server, 3 minutes sounds like a >>> very long time for a region server to be down. Obviously, during that time >>> the data owned by that server is inaccessible. Is there a reason for this >>> long timeout? Can it be configured? >>> >> >> We set it that high for people that try to push too much data to >> clusters that are too small / badly configured and then end up with >> crazy garbage collections. Have fun reading this serie of blog posts: >> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ >> >> Please also see the book about this configuration: >> http://hbase.apache.org/book.html#recommended_configurations >> >
-
Re: Region server crashes when using replicationEran Kutner 2011-03-23, 12:04
I tried that, but still get the same result with the 0.90.2 build. The
worst part is that region server fails, then another one tries to take over and also fails, until the entire cluster is down. The fact that a replication failure such as this can cause a cascading fail on the entire cluster is very troubling. What is the design reason for shutting down the server for a replication error? I also confirmed that the region server failures are not detected by the master even after 10 minutes. I'm not sure how to show this, but I see the servers go down and when I run status 'detailed' they are reported as "live" forever. I have zookeeper.session.timeout configured for 20000, which should cause it to be detected in 20 seconds. -eran On Tue, Mar 22, 2011 at 21:46, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > > You can apply the patch that I included there and that I also > committed to the 0.90 branch. > > J-D > > On Tue, Mar 22, 2011 at 12:37 PM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > Actually, it will probably be connection timeout, not connection > > refused when there is no connection between the two clusters. > > > > Is there a workaround I can implement now for HBASE-3664, can I write > > something in ZK so the server has an old entry to delete and is happy > > with it? > > > > -eran > > > > > > > > > > On Tue, Mar 22, 2011 at 21:01, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > >> Inline. > >> > >> J-D > >> > >> On Tue, Mar 22, 2011 at 11:51 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > >>> Thanks, J-D. > >>> As for the first issue, why does this behavior make sense? What happens when > >>> the connection between the two cluster fails? Will the region servers of the > >>> primary fail as well? or at least won't be able to start? Seems very > >>> radical. > >> > >> The DNS entry should remain, so you won't get UnknownHostException but > >> ConnectionRefused instead. But that's a different issue: HBASE-3130 > >> > >>> > >>> Regarding the second issue, I didn't see anything else in the logs, it just > >>> seemed like it decided to shutdown, but maybe I missed it. I will try to > >>> reproduce that and let you know if I succeed. > >> > >> That'd be nice :) > >> > >>> > >>> Regarding the timeout to detect a failed server, 3 minutes sounds like a > >>> very long time for a region server to be down. Obviously, during that time > >>> the data owned by that server is inaccessible. Is there a reason for this > >>> long timeout? Can it be configured? > >>> > >> > >> We set it that high for people that try to push too much data to > >> clusters that are too small / badly configured and then end up with > >> crazy garbage collections. Have fun reading this serie of blog posts: > >> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ > >> > >> Please also see the book about this configuration: > >> http://hbase.apache.org/book.html#recommended_configurations > >> > >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-23, 16:52
On Wed, Mar 23, 2011 at 4:54 AM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> I tried that, but still get the same result with the 0.90.2 build. The worst > part is that region server fails, then another one tries to take over and > also fails, until the entire cluster is down. The fact that a replication > failure such as this can cause a cascading fail on the entire cluster is > very troubling. What is the design reason for shutting down the server for a > replication error? Yeah looking at the log something is very odd, it's like something else came in and deleted the znode under the region server since previously you can see that it was using it: 2011-03-23 07:38:35,572 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication hadoop1-s02.farm-ny.gigya.com%3A60020.1300880313950 at 0 2011-03-23 07:38:35,590 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:0 and seenEntries:0 and size: 124 2011-03-23 07:38:35,591 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #hadoop1-s02.farm-ny.gigya.com%3A60020.1300880313950 for position 124 in hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s02.farm-ny.gigya.com,60020,1300880313349/hadoop1-s02.farm-ny.gigya.com%3A60020.1300880313950 So it was there, then something deleted it. Also, since you ask, here's a little story about replication. It was originally developed on the 0.89 branch and this is what we have run in production since September. The 0.90 branch has a totally reworked master and also totally reworked interface to ZK. Among other things were added a way to about the region server more easily. When porting over the replication code to the new ZK interface (replication relies heavily on it for coordination), a lot of abort calls were added as a way of telling the developer that something is wrong in the code and that it should be handled. Then, not much testing was done on replication and both 0.90.0 and 0.90.1 were released. This is also why it's not enabled by default and that the documentation says: "This package is experimental quality software and is only meant to be a base for future developments." Meaning, use at your own risk. If you still wanna go down that route, which I would really appreciate since I've been the only one maintaining replication since the beginning, then be prepared to get your hands dirty. > > I also confirmed that the region server failures are not detected by the > master even after 10 minutes. I'm not sure how to show this, but I see the > servers go down and when I run status 'detailed' they are reported as "live" > forever. I have zookeeper.session.timeout configured for 20000, which should > cause it to be detected in 20 seconds. That's very very odd. Is the znode for that region server still in ZooKeeper? Does the master see anything about a znode getting deleted? Thx, J-D
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-23, 22:20
(ugh last email bounced 3 TIMES with spam score too high, something
weird is going on so clearing all old text) That really sounds like a different issue, and more inline with what I guessed earlier, that some other region server is deleting the znode under that region server. I would think of a configuration issue... how did you setup zookeeper for those two clusters? Do they use the same zk ensemble? If so, different root znode or the same? Thx for sticking, J-D
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-23, 22:28
On Wed, Mar 23, 2011 at 3:22 PM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> They are using two separate ensembles, 3 servers in each. I'm trying to > create total independence for each cluster. Can you find out who deleted the znode that the region server failed on? If it reported the status a few times before that, it means it existed whereas the other bug was that the znode never existed. J-D
-
Re: Region server crashes when using replicationEran Kutner 2011-03-24, 08:13
Here's what I found. I started with 2 RSs running in the cluster (#1 and #4).
This is how ZK looked at that point: [zk: hadoop1-zk3:2181(CONNECTED) 25] ls /hbase/rs [hadoop1-s01,60020,1300952215842, hadoop1-s04,60020,1300881354710] [zk: hadoop1-zk3:2181(CONNECTED) 26] ls /hbase/replication/rs [hadoop1-s01.farm-ny.gig.com,60020,1300952215842] I then started RS #2 it seems that it is looking for the replication log file which it can't find (see attached log). Immediatly after that ZK looks like this: [zk: hadoop1-zk3:2181(CONNECTED) 27] ls /hbase/replication/rs [hadoop1-s02.farm-ny.gig.com,60020,1300953027434] This is how .log on HDFS looks at this time (.oldlogs directory is empty): Found 5 items drwxr-xr-x - hbase supergroup 0 2011-03-24 03:36 /hbase/.logs/hadoop1-s01.farm-ny.gig.com,60020,1300952215842 drwxr-xr-x - hbase supergroup 0 2011-03-24 03:50 /hbase/.logs/hadoop1-s02.farm-ny.gig.com,60020,1300953027434 drwxr-xr-x - hbase supergroup 0 2011-03-24 03:36 /hbase/.logs/hadoop1-s03.farm-ny.gig.com,60020,1300952197026 drwxr-xr-x - hbase supergroup 0 2011-03-24 03:56 /hbase/.logs/hadoop1-s04.farm-ny.gig.com,60020,1300881354710 drwxr-xr-x - hbase supergroup 0 2011-03-23 16:57 /hbase/.logs/hadoop1-s05.farm-ny.gig.com,60020,1300913823878 After some time, while I'm writing this, I now see that RS #1 has now crashed and ZK looks like this (see attached log): [zk: hadoop1-zk3:2181(CONNECTED) 29] ls /hbase/replication/rs [hadoop1-s02.farm-ny.gig.com,60020,1300953027434, hadoop1-s04.farm-ny.gig.com,60020,1300881354710] One thing strange I'm noticing is that in the /hbase/rs node the servers are listed with their host name only while in /hbase/replication/rs they are listed with their fully qualified DNS names. Is this intentional? Note: I changed the domain name in this email because I think the mailing list's spam filter doesn't like it. The attached logs still show the full name. -eran On Thu, Mar 24, 2011 at 00:28, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > On Wed, Mar 23, 2011 at 3:22 PM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> They are using two separate ensembles, 3 servers in each. I'm trying to >> create total independence for each cluster. > > Can you find out who deleted the znode that the region server failed > on? If it reported the status a few times before that, it means it > existed whereas the other bug was that the znode never existed. > > J-D >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-24, 20:52
Ah yeah that's the issue, the mixup of FQDNs and hostnames. I wonder
how it got into that state... but that explains why the environment looked so weird! Let me have a quick look at the code to figure why it's different and hopefully I can get you a patch just in time for 0.90.2 J-D > One thing strange I'm noticing is that in the /hbase/rs node the > servers are listed with their host name only while in > /hbase/replication/rs they are listed with their fully qualified DNS > names. Is this intentional? > > Note: I changed the domain name in this email because I think the > mailing list's spam filter doesn't like it. The attached logs still > show the full name. > So you guys are spammers? ;)
-
Re: Region server crashes when using replicationEran Kutner 2011-03-24, 22:16
Now it doesn't like the email because it was in HTML format... As I
said, not a very smart piece of software. On Fri, Mar 25, 2011 at 00:07, Eran Kutner <[EMAIL PROTECTED]> wrote: > > You make it sound like it's a bad thing :) > But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B. > > Thanks for looking into the bug. > > -eran > > > > On Thu, Mar 24, 2011 at 22:52, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: >> >> Ah yeah that's the issue, the mixup of FQDNs and hostnames. I wonder >> how it got into that state... but that explains why the environment >> looked so weird! Let me have a quick look at the code to figure why >> it's different and hopefully I can get you a patch just in time for >> 0.90.2 >> >> J-D >> >> > One thing strange I'm noticing is that in the /hbase/rs node the >> > servers are listed with their host name only while in >> > /hbase/replication/rs they are listed with their fully qualified DNS >> > names. Is this intentional? >> > >> > Note: I changed the domain name in this email because I think the >> > mailing list's spam filter doesn't like it. The attached logs still >> > show the full name. >> > >> >> So you guys are spammers? ;) >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-25, 00:02
Ok so this is the same old DNS issue...
This is the important message in the log: Master passed us address to use. Was=hadoop1-s02:60020, Now=hadoop1-s02.farm-ny.not-a-spammer.com:60020 This means that when the RS tries to resolve itself it gets its hostname, but when the master resolves the RS it gets the FQDN. This is a bug in HBase that we rely on those strings as "true machine identification" but that's how it is at the moment. It happens to be that replication is setup later in the process so it uses the FQDN. The only way you can fix it is to change your DNS settings. Here we resolve everything with their hostnames. Hope that helps and sorry about all the trouble, J-D >> You make it sound like it's a bad thing :) >> But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B. >> >> Thanks for looking into the bug. >> >> -eran
-
Re: Region server crashes when using replicationEran Kutner 2011-03-25, 19:26
Thanks, J-D, that managed to solve a part of the problem. The servers
have stopped crashing and the master now properly detects when a RS goes down, by the way, since the RS does detect this it may be a good idea to stop the server on this event which is a significant configuration issue. However now the replication just doesn't see to work. I didn't change anything in the configuration which already managed to push 2 rows before crashing yesterday. I still see the peer properly configured in ZK, the replication is enabled but nothing is happening. All I see in the log of the RS which holds the table I'm writing into is: 2011-03-25 15:16:56,504 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No log to process, sleeping 1000 times 10 2011-03-25 15:17:07,509 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No log to process, sleeping 1000 times 10 2011-03-25 15:17:18,515 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No log to process, sleeping 1000 times 10 2011-03-25 15:17:29,520 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No log to process, sleeping 1000 times 10 2011-03-25 15:17:40,526 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No log to process, sleeping 1000 times 10 Needless to say nothing get's into the peer cluster. -eran On Fri, Mar 25, 2011 at 02:02, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Ok so this is the same old DNS issue... > > This is the important message in the log: > > Master passed us address to use. Was=hadoop1-s02:60020, > Now=hadoop1-s02.farm-ny.not-a-spammer.com:60020 > > This means that when the RS tries to resolve itself it gets its > hostname, but when the master resolves the RS it gets the FQDN. This > is a bug in HBase that we rely on those strings as "true machine > identification" but that's how it is at the moment. It happens to be > that replication is setup later in the process so it uses the FQDN. > The only way you can fix it is to change your DNS settings. Here we > resolve everything with their hostnames. > > Hope that helps and sorry about all the trouble, > > J-D > >>> You make it sound like it's a bad thing :) >>> But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B. >>> >>> Thanks for looking into the bug. >>> >>> -eran >
-
Re: Region server crashes when using replicationEran Kutner 2011-03-27, 10:17
Had more time to look into it and verify that indeed data is not
replicated because the server doesn't see it in the log. So I tried restarting the RS and sure enough when the table (which has only one region) transitioned to another RS the replication started working (for new data only). So I tried with another table, and same thing, replication doesn't work and the logs says "No log to process" but after restarting the RS and a table transition the replication started working for that table too. Is there something that gets initialized during a transition that could be missing before? -eran On Fri, Mar 25, 2011 at 21:26, Eran Kutner <[EMAIL PROTECTED]> wrote: > > Thanks, J-D, that managed to solve a part of the problem. The servers > have stopped crashing and the master now properly detects when a RS > goes down, by the way, since the RS does detect this it may be a good > idea to stop the server on this event which is a significant > configuration issue. > However now the replication just doesn't see to work. I didn't change > anything in the configuration which already managed to push 2 rows > before crashing yesterday. I still see the peer properly configured in > ZK, the replication is enabled but nothing is happening. All I see in > the log of the RS which holds the table I'm writing into is: > > 2011-03-25 15:16:56,504 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > 2011-03-25 15:17:07,509 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > 2011-03-25 15:17:18,515 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > 2011-03-25 15:17:29,520 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > 2011-03-25 15:17:40,526 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > > Needless to say nothing get's into the peer cluster. > > -eran > > > > > On Fri, Mar 25, 2011 at 02:02, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > > Ok so this is the same old DNS issue... > > > > This is the important message in the log: > > > > Master passed us address to use. Was=hadoop1-s02:60020, > > Now=hadoop1-s02.farm-ny.not-a-spammer.com:60020 > > > > This means that when the RS tries to resolve itself it gets its > > hostname, but when the master resolves the RS it gets the FQDN. This > > is a bug in HBase that we rely on those strings as "true machine > > identification" but that's how it is at the moment. It happens to be > > that replication is setup later in the process so it uses the FQDN. > > The only way you can fix it is to change your DNS settings. Here we > > resolve everything with their hostnames. > > > > Hope that helps and sorry about all the trouble, > > > > J-D > > > >>> You make it sound like it's a bad thing :) > >>> But seriously, SpamAssassin is really not the brightest anti spam software on the plant. You should check out what we're doing, we're actually in the same field as you guys, except our product is B2B. > >>> > >>> Thanks for looking into the bug. > >>> > >>> -eran > >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-27, 17:05
So the message "No log to process" means that nothing was assigned to
replicate at all... I wonder if it's HBASE-3664 that did that... but you should be able to confirm by looking at the region server log when it starts, it should be adding it's own HLog to the list to replicate. Regarding the failover, it might be that the log is indeed registered in zookeeper but somehow isn't in the memory log structure of ReplicationSourceManager... again the logs should tell. Feel free to send me directly a big tar.gz with all those logs and I should be able to figure out. Thx! J-D On Sun, Mar 27, 2011 at 2:21 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Had more time to look into it and verify that indeed data is not replicated > because the server doesn't see it in the log. So I tried restarting the RS > and sure enough when the table (which has only one region) transitioned to > another RS the replication started working (for new data only). > So I tried with another table, and same thing, replication doesn't work and > the logs says "No log to process" but after restarting the RS and a table > transition the replication started working for that table too. Is there > something that gets initialized during a transition that could be missing > before? > > -eran > > >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-28, 18:29
Ah turns out the issue was way simpler than I thought. One example:
2011-03-25 13:55:02,103 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replication is disabled, sleeping 1000 times 10 2011-03-25 13:55:07,762 INFO org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication is now started 2011-03-25 13:55:13,111 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No log to process, sleeping 1000 times 10 (BTW, pro tip when debugging an issue with HBase is to go back to its first occurrence) The issue is that the cluster had replication disabled, which is REALLY disruptive as it disables every replication feature including adding new logs to replicate meaning that if the server starts with replication disabled it won't even add a single log to replicate. Here's an example of when a new log was finally added after a long time of "No log to process": 2011-03-24 03:56:20,538 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No log to process, sleeping 1000 times 10 2011-03-24 03:56:22,848 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed 2011-03-24 03:56:22,911 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-03-24 03:56:22,974 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/hadoop1-s04.farm-ny.etc That was the previous day, on the 25th when replication was enabled no other log with data in it was rolled so none was added to replicate. Bottom line, disabling replication is a kill switch and shouldn't only be used with that functionality in mind. Starting the cluster with replication enabled should make it work right away for you. Thx! J-D On Sun, Mar 27, 2011 at 2:21 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Had more time to look into it and verify that indeed data is not replicated > because the server doesn't see it in the log. So I tried restarting the RS > and sure enough when the table (which has only one region) transitioned to > another RS the replication started working (for new data only). > So I tried with another table, and same thing, replication doesn't work and > the logs says "No log to process" but after restarting the RS and a table > transition the replication started working for that table too. Is there > something that gets initialized during a transition that could be missing > before? > > -eran >
-
Re: Region server crashes when using replicationEran Kutner 2011-03-28, 19:59
Thanks J-D!
I disabled replication because at the time, every time I started it the entire cluster would shut itself down. Any reason why the servers will not create the HLog immediately when they receive the start_replicsation command? Is there a less destructive way to stop and start the replication? Will removing the peer yield better results? (By the way it would be nice if the shell had a "show_peers" command.) The way I planned on using the replication is kind of a "manual by-direction" operation. The idea is to have each cluster in a different data center, one primary and one backup. Initially the replication is configured from primary to backup. If the primary data center goes down, we will switch traffic to the backup DC and reverse the direction of the replication so new writes will eventually sync back to the primary when it comes back online. Right now I see two problems with this plan: 1) it seems that the servers crash if they can't talk to the peer ZK ensemble, which is really a huge problem. 2) I can't be certain when will the HLogs actually start being written unless I restart the entire secondary cluster after reversing the replication direction. Am I right in my understanding of the current state of things? Really appreciate your help! -eran On Mon, Mar 28, 2011 at 20:29, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Ah turns out the issue was way simpler than I thought. One example: > > 2011-03-25 13:55:02,103 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Replication is disabled, sleeping 1000 times 10 > 2011-03-25 13:55:07,762 INFO > org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication > is now started > 2011-03-25 13:55:13,111 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > > (BTW, pro tip when debugging an issue with HBase is to go back to its > first occurrence) > > The issue is that the cluster had replication disabled, which is > REALLY disruptive as it disables every replication feature including > adding new logs to replicate meaning that if the server starts with > replication disabled it won't even add a single log to replicate. > Here's an example of when a new log was finally added after a long > time of "No log to process": > > 2011-03-24 03:56:20,538 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > 2011-03-24 03:56:22,848 DEBUG > org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period > 3600000ms elapsed > 2011-03-24 03:56:22,911 INFO > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using > syncFs -- HDFS-200 > 2011-03-24 03:56:22,974 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: Roll > /hbase/.logs/hadoop1-s04.farm-ny.etc > > That was the previous day, on the 25th when replication was enabled no > other log with data in it was rolled so none was added to replicate. > > Bottom line, disabling replication is a kill switch and shouldn't only > be used with that functionality in mind. Starting the cluster with > replication enabled should make it work right away for you. > > Thx! > > J-D > > On Sun, Mar 27, 2011 at 2:21 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> Had more time to look into it and verify that indeed data is not replicated >> because the server doesn't see it in the log. So I tried restarting the RS >> and sure enough when the table (which has only one region) transitioned to >> another RS the replication started working (for new data only). >> So I tried with another table, and same thing, replication doesn't work and >> the logs says "No log to process" but after restarting the RS and a table >> transition the replication started working for that table too. Is there >> something that gets initialized during a transition that could be missing >> before? >> >> -eran >> >
-
Re: Region server crashes when using replicationJean-Daniel Cryans 2011-03-28, 21:43
Inline.
> Thanks J-D! > I disabled replication because at the time, every time I started it > the entire cluster would shut itself down. > Any reason why the servers will not create the HLog immediately when > they receive the start_replicsation command? In the current code base replication cannot ask anything to the part responsible for the WAL. In any case start/stop replication wasn't built to do what you're trying to do, it's just a dirty kill switch. > Is there a less destructive way to stop and start the replication? > Will removing the peer yield better results? (By the way it would be > nice if the shell had a "show_peers" command.) There's a enable/disable command I have yet to implement :) Adding/removing the peer should do the trick too. I agree we need to list peers (that could be a nice first contribution wink wink). > 1) it seems that the servers crash if they can't talk to the peer ZK ensemble, which is really a huge problem. Like we previously discussed, this only happens when the region server starts and it's also very easy to fix (just catch the right exception). > 2) I can't be certain when will the HLogs actually start being written unless I restart the entire secondary cluster after reversing the replication direction. That's when you use star/stop replication, which like I said isn't designed to do what you want to do. Adding/removing the peers will work correctly in this case.
-
Re: Region server crashes when using replicationEran Kutner 2011-03-29, 11:29
Thanks again J-D. I will avoid using stop_replication from now on.
As for the shell, JRuby (or even Java for that matter) is not really our strong suit here, but I'll try to give it a look when I have some time. -eran On Mon, Mar 28, 2011 at 23:43, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Inline. > >> Thanks J-D! >> I disabled replication because at the time, every time I started it >> the entire cluster would shut itself down. >> Any reason why the servers will not create the HLog immediately when >> they receive the start_replicsation command? > > In the current code base replication cannot ask anything to the part > responsible for the WAL. In any case start/stop replication wasn't > built to do what you're trying to do, it's just a dirty kill switch. > >> Is there a less destructive way to stop and start the replication? >> Will removing the peer yield better results? (By the way it would be >> nice if the shell had a "show_peers" command.) > > There's a enable/disable command I have yet to implement :) > Adding/removing the peer should do the trick too. I agree we need to > list peers (that could be a nice first contribution wink wink). > >> 1) it seems that the servers crash if they can't talk to the peer ZK ensemble, which is really a huge problem. > > Like we previously discussed, this only happens when the region server > starts and it's also very easy to fix (just catch the right > exception). > >> 2) I can't be certain when will the HLogs actually start being written unless I restart the entire secondary cluster after reversing the replication direction. > > That's when you use star/stop replication, which like I said isn't > designed to do what you want to do. Adding/removing the peers will > work correctly in this case. > |