|
lars hofhansl
2012-11-04, 00:37
rajesh babu chintaguntla
2012-11-04, 02:10
Matt Corgan
2012-11-04, 02:27
ramkrishna vasudevan
2012-11-04, 17:12
Ted Yu
2012-11-04, 22:07
Matt Corgan
2012-11-05, 01:16
Matt Corgan
2012-11-05, 01:20
Jean-Daniel Cryans
2012-11-05, 17:52
Matt Corgan
2012-11-05, 22:02
Jean-Daniel Cryans
2012-11-05, 22:15
Matt Corgan
2012-11-06, 01:33
Ted Yu
2012-11-06, 03:07
Matt Corgan
2012-11-06, 05:10
lars hofhansl
2012-11-06, 05:17
Matt Corgan
2012-11-06, 05:28
lars hofhansl
2012-11-06, 05:45
ramkrishna vasudevan
2012-11-06, 05:59
ramkrishna vasudevan
2012-11-06, 06:30
Matt Corgan
2012-11-06, 08:12
ramkrishna vasudevan
2012-11-06, 10:07
ramkrishna vasudevan
2012-11-06, 11:29
lars hofhansl
2012-11-06, 15:13
Matt Corgan
2012-11-06, 23:36
|
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2lars hofhansl 2012-11-04, 00:37
CC'ing dev list...
Is anybody aware of any changes that went in recently that could cause this? I looked around a bit, but could not find anything obvious. -- Lars ________________________________ From: Matt Corgan <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Saturday, November 3, 2012 5:27 PM Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 I think the cluster is ok without running hbck, as restarting the regionserver process stops the warnings and everything looks ok otherwise. here's the regionserver right after the split happens: ------------------------ 2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ab46479832953 Attempting to transition node bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* *EGION_SPLITTING to RS_ZK_REGION_SPLIT* 2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ab46479832953 Successfully transitioned node bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ REGION_SPLITTING to RS_ZK_REGION_SPLIT 2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the master to process the split for bc62a8a72124a4ba3f6b9f302587903c 2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ab46479832953 Attempting to transition node bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R EGION_SPLIT to RS_ZK_REGION_SPLIT 2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ab46479832953 Successfully transitioned node bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ REGION_SPLIT to RS_ZK_REGION_SPLIT ----------------------------- The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT" continues for 15 or so hours and finally settles without manual intervention with these regionserver log messages: ----------------------- 2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ab46479832953 Attempting to transition node * bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT 2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x13ab46479832953 Successfully transitioned node * bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT 2012-11-02 13:55:00,916 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META updated, and report to master. Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. *bc62a8a72124a4ba3f6b9f302587903c*., new regions: ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2.. Split took 15hrs, 9mins, 33sec 2012-11-02 13:55:00,945 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META updated, and report to master. Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3., new regions: ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3., ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.* bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec ---------------------- And the master finally logs this: ----------------------- 2012-11-02 13:55:00,783 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region bc62a8a72124a4ba3f6b9f302587903c not found on server HadoopNode162.hotpads.srv,60020,1351788248279; failed processing 2012-11-02 13:55:00,783 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region bc62a8a72124a4ba3f6b9f302587903c from server HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore, probably already processed its split ------------------------ I can't find any evidence that the region left the node during that time. I'll have to catch it in action next time and see what the region is up to during the problem period. What does it mean that it successfully transitioned from SPLIT to SPLIT? Is that a valid transition? Matt On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2rajesh babu chintaguntla 2012-11-04, 02:10
Hi Matt,
can you paste some more master logs of region bc62a8a72124a4ba3f6b9f30258790 before split. I think Its not problem with splitting. We are getting LOG.warn("Region " + encodedName + " not found on server " + serverName + "; failed processing"); this log means no entry in servers map(not fully assigned). Set<HRegionInfo> hris = this.servers.get(sn); HRegionInfo foundHri = null; for (HRegionInfo hri: hris) { if (hri.getEncodedName().equals(encodedName)) { foundHri = hri; break; } } return foundHri; On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > CC'ing dev list... > > Is anybody aware of any changes that went in recently that could cause > this? > I looked around a bit, but could not find anything obvious. > > -- Lars > > > > ________________________________ > From: Matt Corgan <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Saturday, November 3, 2012 5:27 PM > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > I think the cluster is ok without running hbck, as restarting the > regionserver process stops the warnings and everything looks ok otherwise. > > here's the regionserver right after the split happens: > ------------------------ > 2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x13ab46479832953 Attempting to transition node > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > 2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x13ab46479832953 Successfully transitioned node > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > REGION_SPLITTING to RS_ZK_REGION_SPLIT > 2012-11-01 22:45:28,730 DEBUG > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the > master to process the split for bc62a8a72124a4ba3f6b9f302587903c > 2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x13ab46479832953 Attempting to transition node > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R > EGION_SPLIT to RS_ZK_REGION_SPLIT > 2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x13ab46479832953 Successfully transitioned node > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > REGION_SPLIT to RS_ZK_REGION_SPLIT > ----------------------------- > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT" > continues for 15 or so hours and finally settles without manual > intervention with these regionserver log messages: > ----------------------- > 2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x13ab46479832953 Attempting to transition node * > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > RS_ZK_REGION_SPLIT > > 2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x13ab46479832953 Successfully transitioned node * > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > RS_ZK_REGION_SPLIT > > 2012-11-02 13:55:00,916 INFO > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META > updated, and report to master. > > Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. > *bc62a8a72124a4ba3f6b9f302587903c*., new regions: > > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., > > ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2.. > Split took 15hrs, 9mins, 33sec > > 2012-11-02 13:55:00,945 INFO > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META > updated, and report to master. > > Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3., > new regions: > > ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-04, 02:27
Strangely, I don't see any record of that region in the master before what
I already pasted even though I have logs back to 10/30. Next time it happens I'll gather a full log record and try to debug while it's occurring. On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < [EMAIL PROTECTED]> wrote: > Hi Matt, > can you paste some more master logs of region > bc62a8a72124a4ba3f6b9f30258790 before split. > I think Its not problem with splitting. > We are getting > LOG.warn("Region " + encodedName + " not found on server " + > serverName + > "; failed processing"); > this log means no entry in servers map(not fully assigned). > Set<HRegionInfo> hris = this.servers.get(sn); > HRegionInfo foundHri = null; > for (HRegionInfo hri: hris) { > if (hri.getEncodedName().equals(encodedName)) { > foundHri = hri; > break; > } > } > return foundHri; > > > > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > CC'ing dev list... > > > > Is anybody aware of any changes that went in recently that could cause > > this? > > I looked around a bit, but could not find anything obvious. > > > > -- Lars > > > > > > > > ________________________________ > > From: Matt Corgan <[EMAIL PROTECTED]> > > To: user <[EMAIL PROTECTED]> > > Sent: Saturday, November 3, 2012 5:27 PM > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > > > I think the cluster is ok without running hbck, as restarting the > > regionserver process stops the warnings and everything looks ok > otherwise. > > > > here's the regionserver right after the split happens: > > ------------------------ > > 2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > > 2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > REGION_SPLITTING to RS_ZK_REGION_SPLIT > > 2012-11-01 22:45:28,730 DEBUG > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on > the > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c > > 2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R > > EGION_SPLIT to RS_ZK_REGION_SPLIT > > 2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > REGION_SPLIT to RS_ZK_REGION_SPLIT > > ----------------------------- > > > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT" > > continues for 15 or so hours and finally settles without manual > > intervention with these regionserver log messages: > > ----------------------- > > 2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832953 Attempting to transition node * > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > > RS_ZK_REGION_SPLIT > > > > 2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832953 Successfully transitioned node * > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > > RS_ZK_REGION_SPLIT > > > > 2012-11-02 13:55:00,916 INFO > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META > > updated, and report to master. > > > > > Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions: > > > > > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., > > > > > ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2ramkrishna vasudevan 2012-11-04, 17:12
Seems strange. I too will look into this. Backtracking the region and its
parent will give an idea. Regards Ram On Sun, Nov 4, 2012 at 7:57 AM, Matt Corgan <[EMAIL PROTECTED]> wrote: > Strangely, I don't see any record of that region in the master before what > I already pasted even though I have logs back to 10/30. Next time it > happens I'll gather a full log record and try to debug while it's > occurring. > > > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < > [EMAIL PROTECTED]> wrote: > > > Hi Matt, > > can you paste some more master logs of region > > bc62a8a72124a4ba3f6b9f30258790 before split. > > I think Its not problem with splitting. > > We are getting > > LOG.warn("Region " + encodedName + " not found on server " + > > serverName + > > "; failed processing"); > > this log means no entry in servers map(not fully assigned). > > Set<HRegionInfo> hris = this.servers.get(sn); > > HRegionInfo foundHri = null; > > for (HRegionInfo hri: hris) { > > if (hri.getEncodedName().equals(encodedName)) { > > foundHri = hri; > > break; > > } > > } > > return foundHri; > > > > > > > > > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > > > CC'ing dev list... > > > > > > Is anybody aware of any changes that went in recently that could cause > > > this? > > > I looked around a bit, but could not find anything obvious. > > > > > > -- Lars > > > > > > > > > > > > ________________________________ > > > From: Matt Corgan <[EMAIL PROTECTED]> > > > To: user <[EMAIL PROTECTED]> > > > Sent: Saturday, November 3, 2012 5:27 PM > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > > > > > I think the cluster is ok without running hbck, as restarting the > > > regionserver process stops the warnings and everything looks ok > > otherwise. > > > > > > here's the regionserver right after the split happens: > > > ------------------------ > > > 2012-11-01 22:45:28,726 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > > > 2012-11-01 22:45:28,730 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT > > > 2012-11-01 22:45:28,730 DEBUG > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on > > the > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c > > > 2012-11-01 22:45:28,832 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R > > > EGION_SPLIT to RS_ZK_REGION_SPLIT > > > 2012-11-01 22:45:28,837 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > REGION_SPLIT to RS_ZK_REGION_SPLIT > > > ----------------------------- > > > > > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT" > > > continues for 15 or so hours and finally settles without manual > > > intervention with these regionserver log messages: > > > ----------------------- > > > 2012-11-02 13:55:00,906 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node * > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > > > RS_ZK_REGION_SPLIT > > > > > > 2012-11-02 13:55:00,916 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node * > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > > > RS_ZK_REGION_SPLIT > > > > > > 2012-11-02 13:55:00,916 INFO
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Ted Yu 2012-11-04, 22:07
Matt:
Was there any region server crash before you noticed the repeated log for bc62a8a72124a4ba3f6b9f302587903c ? I wonder if the following JIRA might be related: HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter Thanks On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > Strangely, I don't see any record of that region in the master before what > I already pasted even though I have logs back to 10/30. Next time it > happens I'll gather a full log record and try to debug while it's > occurring. > > > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < > [EMAIL PROTECTED]> wrote: > > > Hi Matt, > > can you paste some more master logs of region > > bc62a8a72124a4ba3f6b9f30258790 before split. > > I think Its not problem with splitting. > > We are getting > > LOG.warn("Region " + encodedName + " not found on server " + > > serverName + > > "; failed processing"); > > this log means no entry in servers map(not fully assigned). > > Set<HRegionInfo> hris = this.servers.get(sn); > > HRegionInfo foundHri = null; > > for (HRegionInfo hri: hris) { > > if (hri.getEncodedName().equals(encodedName)) { > > foundHri = hri; > > break; > > } > > } > > return foundHri; > > > > > > > > > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > > > CC'ing dev list... > > > > > > Is anybody aware of any changes that went in recently that could cause > > > this? > > > I looked around a bit, but could not find anything obvious. > > > > > > -- Lars > > > > > > > > > > > > ________________________________ > > > From: Matt Corgan <[EMAIL PROTECTED]> > > > To: user <[EMAIL PROTECTED]> > > > Sent: Saturday, November 3, 2012 5:27 PM > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > > > > > I think the cluster is ok without running hbck, as restarting the > > > regionserver process stops the warnings and everything looks ok > > otherwise. > > > > > > here's the regionserver right after the split happens: > > > ------------------------ > > > 2012-11-01 22:45:28,726 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > > > 2012-11-01 22:45:28,730 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT > > > 2012-11-01 22:45:28,730 DEBUG > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on > > the > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c > > > 2012-11-01 22:45:28,832 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R > > > EGION_SPLIT to RS_ZK_REGION_SPLIT > > > 2012-11-01 22:45:28,837 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > REGION_SPLIT to RS_ZK_REGION_SPLIT > > > ----------------------------- > > > > > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT" > > > continues for 15 or so hours and finally settles without manual > > > intervention with these regionserver log messages: > > > ----------------------- > > > 2012-11-02 13:55:00,906 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node * > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > > > RS_ZK_REGION_SPLIT > > > > > > 2012-11-02 13:55:00,916 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node * >
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-05, 01:16
I've had a few successful splits today without any errors. I'll turn up
the importer speed tomorrow to start stressing the cluster more. On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Matt: > Was there any region server crash before you noticed the repeated log for > bc62a8a72124a4ba3f6b9f302587903c ? > > I wonder if the following JIRA might be related: > HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter > > Thanks > > On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > Strangely, I don't see any record of that region in the master before > what > > I already pasted even though I have logs back to 10/30. Next time it > > happens I'll gather a full log record and try to debug while it's > > occurring. > > > > > > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Matt, > > > can you paste some more master logs of region > > > bc62a8a72124a4ba3f6b9f30258790 before split. > > > I think Its not problem with splitting. > > > We are getting > > > LOG.warn("Region " + encodedName + " not found on server " + > > > serverName + > > > "; failed processing"); > > > this log means no entry in servers map(not fully assigned). > > > Set<HRegionInfo> hris = this.servers.get(sn); > > > HRegionInfo foundHri = null; > > > for (HRegionInfo hri: hris) { > > > if (hri.getEncodedName().equals(encodedName)) { > > > foundHri = hri; > > > break; > > > } > > > } > > > return foundHri; > > > > > > > > > > > > > > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > > > > CC'ing dev list... > > > > > > > > Is anybody aware of any changes that went in recently that could > cause > > > > this? > > > > I looked around a bit, but could not find anything obvious. > > > > > > > > -- Lars > > > > > > > > > > > > > > > > ________________________________ > > > > From: Matt Corgan <[EMAIL PROTECTED]> > > > > To: user <[EMAIL PROTECTED]> > > > > Sent: Saturday, November 3, 2012 5:27 PM > > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > > > > > > > I think the cluster is ok without running hbck, as restarting the > > > > regionserver process stops the warnings and everything looks ok > > > otherwise. > > > > > > > > here's the regionserver right after the split happens: > > > > ------------------------ > > > > 2012-11-01 22:45:28,726 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > > > > 2012-11-01 22:45:28,730 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT > > > > 2012-11-01 22:45:28,730 DEBUG > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting > on > > > the > > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c > > > > 2012-11-01 22:45:28,832 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R > > > > EGION_SPLIT to RS_ZK_REGION_SPLIT > > > > 2012-11-01 22:45:28,837 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > > REGION_SPLIT to RS_ZK_REGION_SPLIT > > > > ----------------------------- > > > > > > > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT" > > > > continues for 15 or so hours and finally settles without manual > > > > intervention with these regionserver log messages: > > > > ----------------------- > > > > 2012-11-02 13:55:00,906 DEBUG
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-05, 01:20
Forgot to answer Ted's question - I don't remember exactly what we did to
the cluster. I don't think there were any regionserver crashes, but there were probably controlled restarts, new servers added to the running cluster, and some regions moved by our manual balancer job. On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > I've had a few successful splits today without any errors. I'll turn up > the importer speed tomorrow to start stressing the cluster more. > > > > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> Matt: >> Was there any region server crash before you noticed the repeated log for >> bc62a8a72124a4ba3f6b9f302587903c ? >> >> I wonder if the following JIRA might be related: >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter >> >> Thanks >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: >> >> > Strangely, I don't see any record of that region in the master before >> what >> > I already pasted even though I have logs back to 10/30. Next time it >> > happens I'll gather a full log record and try to debug while it's >> > occurring. >> > >> > >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Hi Matt, >> > > can you paste some more master logs of region >> > > bc62a8a72124a4ba3f6b9f30258790 before split. >> > > I think Its not problem with splitting. >> > > We are getting >> > > LOG.warn("Region " + encodedName + " not found on server " + >> > > serverName + >> > > "; failed processing"); >> > > this log means no entry in servers map(not fully assigned). >> > > Set<HRegionInfo> hris = this.servers.get(sn); >> > > HRegionInfo foundHri = null; >> > > for (HRegionInfo hri: hris) { >> > > if (hri.getEncodedName().equals(encodedName)) { >> > > foundHri = hri; >> > > break; >> > > } >> > > } >> > > return foundHri; >> > > >> > > >> > > >> > > >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> >> > wrote: >> > > >> > > > CC'ing dev list... >> > > > >> > > > Is anybody aware of any changes that went in recently that could >> cause >> > > > this? >> > > > I looked around a bit, but could not find anything obvious. >> > > > >> > > > -- Lars >> > > > >> > > > >> > > > >> > > > ________________________________ >> > > > From: Matt Corgan <[EMAIL PROTECTED]> >> > > > To: user <[EMAIL PROTECTED]> >> > > > Sent: Saturday, November 3, 2012 5:27 PM >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > > > >> > > > I think the cluster is ok without running hbck, as restarting the >> > > > regionserver process stops the warnings and everything looks ok >> > > otherwise. >> > > > >> > > > here's the regionserver right after the split happens: >> > > > ------------------------ >> > > > 2012-11-01 22:45:28,726 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* >> > > > 2012-11-01 22:45:28,730 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT >> > > > 2012-11-01 22:45:28,730 DEBUG >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still >> waiting on >> > > the >> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c >> > > > 2012-11-01 22:45:28,832 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT >> > > > 2012-11-01 22:45:28,837 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Jean-Daniel Cryans 2012-11-05, 17:52
This reminds me a lot of https://issues.apache.org/jira/browse/HBASE-4792
What I'd like to see is the log from the first time the master receives the split message. I guess it says the region doesn't exist anymore because the split was processed already in the master but there's a failure mode similar to 4792. I saw this on another cluster last week but the logs were rolling based on size so the original split message was lost. J-D On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > I've had a few successful splits today without any errors. I'll turn up > the importer speed tomorrow to start stressing the cluster more. > > > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> Matt: >> Was there any region server crash before you noticed the repeated log for >> bc62a8a72124a4ba3f6b9f302587903c ? >> >> I wonder if the following JIRA might be related: >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter >> >> Thanks >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: >> >> > Strangely, I don't see any record of that region in the master before >> what >> > I already pasted even though I have logs back to 10/30. Next time it >> > happens I'll gather a full log record and try to debug while it's >> > occurring. >> > >> > >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Hi Matt, >> > > can you paste some more master logs of region >> > > bc62a8a72124a4ba3f6b9f30258790 before split. >> > > I think Its not problem with splitting. >> > > We are getting >> > > LOG.warn("Region " + encodedName + " not found on server " + >> > > serverName + >> > > "; failed processing"); >> > > this log means no entry in servers map(not fully assigned). >> > > Set<HRegionInfo> hris = this.servers.get(sn); >> > > HRegionInfo foundHri = null; >> > > for (HRegionInfo hri: hris) { >> > > if (hri.getEncodedName().equals(encodedName)) { >> > > foundHri = hri; >> > > break; >> > > } >> > > } >> > > return foundHri; >> > > >> > > >> > > >> > > >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> >> > wrote: >> > > >> > > > CC'ing dev list... >> > > > >> > > > Is anybody aware of any changes that went in recently that could >> cause >> > > > this? >> > > > I looked around a bit, but could not find anything obvious. >> > > > >> > > > -- Lars >> > > > >> > > > >> > > > >> > > > ________________________________ >> > > > From: Matt Corgan <[EMAIL PROTECTED]> >> > > > To: user <[EMAIL PROTECTED]> >> > > > Sent: Saturday, November 3, 2012 5:27 PM >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > > > >> > > > I think the cluster is ok without running hbck, as restarting the >> > > > regionserver process stops the warnings and everything looks ok >> > > otherwise. >> > > > >> > > > here's the regionserver right after the split happens: >> > > > ------------------------ >> > > > 2012-11-01 22:45:28,726 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* >> > > > 2012-11-01 22:45:28,730 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT >> > > > 2012-11-01 22:45:28,730 DEBUG >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting >> on >> > > the >> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c >> > > > 2012-11-01 22:45:28,832 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-05, 22:02
Here's one from last night with the master logs at the bottom.
http://pastebin.com/cSdMbA2a I don't see that region in the master logs for 4 days back. I'm going to import some new data from scratch soon and will be sure to keep all the master logs. On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > This reminds me a lot of https://issues.apache.org/jira/browse/HBASE-4792 > > What I'd like to see is the log from the first time the master > receives the split message. I guess it says the region doesn't exist > anymore because the split was processed already in the master but > there's a failure mode similar to 4792. > > I saw this on another cluster last week but the logs were rolling > based on size so the original split message was lost. > > J-D > > On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > I've had a few successful splits today without any errors. I'll turn up > > the importer speed tomorrow to start stressing the cluster more. > > > > > > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > >> Matt: > >> Was there any region server crash before you noticed the repeated log > for > >> bc62a8a72124a4ba3f6b9f302587903c ? > >> > >> I wonder if the following JIRA might be related: > >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter > >> > >> Thanks > >> > >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > >> > >> > Strangely, I don't see any record of that region in the master before > >> what > >> > I already pasted even though I have logs back to 10/30. Next time it > >> > happens I'll gather a full log record and try to debug while it's > >> > occurring. > >> > > >> > > >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> > > Hi Matt, > >> > > can you paste some more master logs of region > >> > > bc62a8a72124a4ba3f6b9f30258790 before split. > >> > > I think Its not problem with splitting. > >> > > We are getting > >> > > LOG.warn("Region " + encodedName + " not found on server " + > >> > > serverName + > >> > > "; failed processing"); > >> > > this log means no entry in servers map(not fully assigned). > >> > > Set<HRegionInfo> hris = this.servers.get(sn); > >> > > HRegionInfo foundHri = null; > >> > > for (HRegionInfo hri: hris) { > >> > > if (hri.getEncodedName().equals(encodedName)) { > >> > > foundHri = hri; > >> > > break; > >> > > } > >> > > } > >> > > return foundHri; > >> > > > >> > > > >> > > > >> > > > >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> > >> > wrote: > >> > > > >> > > > CC'ing dev list... > >> > > > > >> > > > Is anybody aware of any changes that went in recently that could > >> cause > >> > > > this? > >> > > > I looked around a bit, but could not find anything obvious. > >> > > > > >> > > > -- Lars > >> > > > > >> > > > > >> > > > > >> > > > ________________________________ > >> > > > From: Matt Corgan <[EMAIL PROTECTED]> > >> > > > To: user <[EMAIL PROTECTED]> > >> > > > Sent: Saturday, November 3, 2012 5:27 PM > >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > >> > > > > >> > > > I think the cluster is ok without running hbck, as restarting the > >> > > > regionserver process stops the warnings and everything looks ok > >> > > otherwise. > >> > > > > >> > > > here's the regionserver right after the split happens: > >> > > > ------------------------ > >> > > > 2012-11-01 22:45:28,726 DEBUG > >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > >> > > > 2012-11-01 22:45:28,730 DEBUG > >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Jean-Daniel Cryans 2012-11-05, 22:15
Wow, and can you figure how this happened?
2012-11-05 00:24:54,538 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. Split? Else how did the master told that RS to open a region that doesn't exist? On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > Here's one from last night with the master logs at the bottom. > http://pastebin.com/cSdMbA2a > > I don't see that region in the master logs for 4 days back. I'm going to > import some new data from scratch soon and will be sure to keep all the > master logs. > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> This reminds me a lot of https://issues.apache.org/jira/browse/HBASE-4792 >> >> What I'd like to see is the log from the first time the master >> receives the split message. I guess it says the region doesn't exist >> anymore because the split was processed already in the master but >> there's a failure mode similar to 4792. >> >> I saw this on another cluster last week but the logs were rolling >> based on size so the original split message was lost. >> >> J-D >> >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: >> > I've had a few successful splits today without any errors. I'll turn up >> > the importer speed tomorrow to start stressing the cluster more. >> > >> > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: >> > >> >> Matt: >> >> Was there any region server crash before you noticed the repeated log >> for >> >> bc62a8a72124a4ba3f6b9f302587903c ? >> >> >> >> I wonder if the following JIRA might be related: >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter >> >> >> >> Thanks >> >> >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> >> wrote: >> >> >> >> > Strangely, I don't see any record of that region in the master before >> >> what >> >> > I already pasted even though I have logs back to 10/30. Next time it >> >> > happens I'll gather a full log record and try to debug while it's >> >> > occurring. >> >> > >> >> > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < >> >> > [EMAIL PROTECTED]> wrote: >> >> > >> >> > > Hi Matt, >> >> > > can you paste some more master logs of region >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split. >> >> > > I think Its not problem with splitting. >> >> > > We are getting >> >> > > LOG.warn("Region " + encodedName + " not found on server " + >> >> > > serverName + >> >> > > "; failed processing"); >> >> > > this log means no entry in servers map(not fully assigned). >> >> > > Set<HRegionInfo> hris = this.servers.get(sn); >> >> > > HRegionInfo foundHri = null; >> >> > > for (HRegionInfo hri: hris) { >> >> > > if (hri.getEncodedName().equals(encodedName)) { >> >> > > foundHri = hri; >> >> > > break; >> >> > > } >> >> > > } >> >> > > return foundHri; >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[EMAIL PROTECTED]> >> >> > wrote: >> >> > > >> >> > > > CC'ing dev list... >> >> > > > >> >> > > > Is anybody aware of any changes that went in recently that could >> >> cause >> >> > > > this? >> >> > > > I looked around a bit, but could not find anything obvious. >> >> > > > >> >> > > > -- Lars >> >> > > > >> >> > > > >> >> > > > >> >> > > > ________________________________ >> >> > > > From: Matt Corgan <[EMAIL PROTECTED]> >> >> > > > To: user <[EMAIL PROTECTED]> >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> >> > > > >> >> > > > I think the cluster is ok without running hbck, as restarting the >> >> > > > regionserver process stops the warnings and everything looks ok >> >> > > otherwise. >> >> > > > >> >> > > > here's the regionserver right after the split happens:
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-06, 01:33
Beats me - i can't find any record of it before that.
I'm importing data into another table now. I disabled/enabled the table first to make sure we have the original 4 region locations logged everywhere. Will report back... On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > Wow, and can you figure how this happened? > > 2012-11-05 00:24:54,538 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > > Split? Else how did the master told that RS to open a region that doesn't > exist? > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > Here's one from last night with the master logs at the bottom. > > http://pastebin.com/cSdMbA2a > > > > I don't see that region in the master logs for 4 days back. I'm going to > > import some new data from scratch soon and will be sure to keep all the > > master logs. > > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > >> This reminds me a lot of > https://issues.apache.org/jira/browse/HBASE-4792 > >> > >> What I'd like to see is the log from the first time the master > >> receives the split message. I guess it says the region doesn't exist > >> anymore because the split was processed already in the master but > >> there's a failure mode similar to 4792. > >> > >> I saw this on another cluster last week but the logs were rolling > >> based on size so the original split message was lost. > >> > >> J-D > >> > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > >> > I've had a few successful splits today without any errors. I'll turn > up > >> > the importer speed tomorrow to start stressing the cluster more. > >> > > >> > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> > > >> >> Matt: > >> >> Was there any region server crash before you noticed the repeated log > >> for > >> >> bc62a8a72124a4ba3f6b9f302587903c ? > >> >> > >> >> I wonder if the following JIRA might be related: > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter > >> >> > >> >> Thanks > >> >> > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> > >> wrote: > >> >> > >> >> > Strangely, I don't see any record of that region in the master > before > >> >> what > >> >> > I already pasted even though I have logs back to 10/30. Next time > it > >> >> > happens I'll gather a full log record and try to debug while it's > >> >> > occurring. > >> >> > > >> >> > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < > >> >> > [EMAIL PROTECTED]> wrote: > >> >> > > >> >> > > Hi Matt, > >> >> > > can you paste some more master logs of region > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split. > >> >> > > I think Its not problem with splitting. > >> >> > > We are getting > >> >> > > LOG.warn("Region " + encodedName + " not found on server " > + > >> >> > > serverName + > >> >> > > "; failed processing"); > >> >> > > this log means no entry in servers map(not fully assigned). > >> >> > > Set<HRegionInfo> hris = this.servers.get(sn); > >> >> > > HRegionInfo foundHri = null; > >> >> > > for (HRegionInfo hri: hris) { > >> >> > > if (hri.getEncodedName().equals(encodedName)) { > >> >> > > foundHri = hri; > >> >> > > break; > >> >> > > } > >> >> > > } > >> >> > > return foundHri; > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl < > [EMAIL PROTECTED]> > >> >> > wrote: > >> >> > > > >> >> > > > CC'ing dev list... > >> >> > > > > >> >> > > > Is anybody aware of any changes that went in recently that > could > >> >> cause > >> >> > > > this? > >> >> > > > I looked around a bit, but could not find anything obvious. > >> >> > > > > >> >> > > > -- Lars > >> >> > > > > >> >
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Ted Yu 2012-11-06, 03:07
I think what Matt encountered was this:
https://issues.apache.org/jira/browse/HBASE-7101 On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > Beats me - i can't find any record of it before that. > > I'm importing data into another table now. I disabled/enabled the table > first to make sure we have the original 4 region locations logged > everywhere. Will report back... > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > Wow, and can you figure how this happened? > > > > 2012-11-05 00:24:54,538 DEBUG > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > > > > > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > > > > Split? Else how did the master told that RS to open a region that doesn't > > exist? > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > Here's one from last night with the master logs at the bottom. > > > http://pastebin.com/cSdMbA2a > > > > > > I don't see that region in the master logs for 4 days back. I'm going > to > > > import some new data from scratch soon and will be sure to keep all the > > > master logs. > > > > > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans < > [EMAIL PROTECTED] > > >wrote: > > > > > >> This reminds me a lot of > > https://issues.apache.org/jira/browse/HBASE-4792 > > >> > > >> What I'd like to see is the log from the first time the master > > >> receives the split message. I guess it says the region doesn't exist > > >> anymore because the split was processed already in the master but > > >> there's a failure mode similar to 4792. > > >> > > >> I saw this on another cluster last week but the logs were rolling > > >> based on size so the original split message was lost. > > >> > > >> J-D > > >> > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> > > wrote: > > >> > I've had a few successful splits today without any errors. I'll > turn > > up > > >> > the importer speed tomorrow to start stressing the cluster more. > > >> > > > >> > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > >> > > > >> >> Matt: > > >> >> Was there any region server crash before you noticed the repeated > log > > >> for > > >> >> bc62a8a72124a4ba3f6b9f302587903c ? > > >> >> > > >> >> I wonder if the following JIRA might be related: > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing > daughter > > >> >> > > >> >> Thanks > > >> >> > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED]> > > >> wrote: > > >> >> > > >> >> > Strangely, I don't see any record of that region in the master > > before > > >> >> what > > >> >> > I already pasted even though I have logs back to 10/30. Next > time > > it > > >> >> > happens I'll gather a full log record and try to debug while it's > > >> >> > occurring. > > >> >> > > > >> >> > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < > > >> >> > [EMAIL PROTECTED]> wrote: > > >> >> > > > >> >> > > Hi Matt, > > >> >> > > can you paste some more master logs of region > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split. > > >> >> > > I think Its not problem with splitting. > > >> >> > > We are getting > > >> >> > > LOG.warn("Region " + encodedName + " not found on server > " > > + > > >> >> > > serverName + > > >> >> > > "; failed processing"); > > >> >> > > this log means no entry in servers map(not fully assigned). > > >> >> > > Set<HRegionInfo> hris = this.servers.get(sn); > > >> >> > > HRegionInfo foundHri = null; > > >> >> > > for (HRegionInfo hri: hris) { > > >> >> > > if (hri.getEncodedName().equals(encodedName)) { > > >> >> > > foundHri = hri; > > >> >> > > break; > > >> >> > > } > > >> >> > > } > > >> >> > > return foundHri; > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-06, 05:10
It happened in this new table that I have all the logs for. The region in
question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace of this region is on the regionserver with the "Intantiated tableName..." message 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 9784c21cbc2. I also know this region came from a recent split, but I can't find any log messages show the parent finishing the split that created this daughter region. So my guess now is that the split is actually finishing and letting clients continue to write data, but something is failing to print the log line and correctly tell the master about the new region. I've noticed that these regions are showing up on the clients in calls to HTable.getRegionLocations(), so the clients continue to function, but if I call HBaseAdmin.move() i get an UnknownRegionException. On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > I think what Matt encountered was this: > https://issues.apache.org/jira/browse/HBASE-7101 > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > Beats me - i can't find any record of it before that. > > > > I'm importing data into another table now. I disabled/enabled the table > > first to make sure we have the original 4 region locations logged > > everywhere. Will report back... > > > > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > > >wrote: > > > > > Wow, and can you figure how this happened? > > > > > > 2012-11-05 00:24:54,538 DEBUG > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > > > > > > > > > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > > > > > > Split? Else how did the master told that RS to open a region that > doesn't > > > exist? > > > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > > > > Here's one from last night with the master logs at the bottom. > > > > http://pastebin.com/cSdMbA2a > > > > > > > > I don't see that region in the master logs for 4 days back. I'm > going > > to > > > > import some new data from scratch soon and will be sure to keep all > the > > > > master logs. > > > > > > > > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> This reminds me a lot of > > > https://issues.apache.org/jira/browse/HBASE-4792 > > > >> > > > >> What I'd like to see is the log from the first time the master > > > >> receives the split message. I guess it says the region doesn't exist > > > >> anymore because the split was processed already in the master but > > > >> there's a failure mode similar to 4792. > > > >> > > > >> I saw this on another cluster last week but the logs were rolling > > > >> based on size so the original split message was lost. > > > >> > > > >> J-D > > > >> > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> > > > wrote: > > > >> > I've had a few successful splits today without any errors. I'll > > turn > > > up > > > >> > the importer speed tomorrow to start stressing the cluster more. > > > >> > > > > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> > wrote: > > > >> > > > > >> >> Matt: > > > >> >> Was there any region server crash before you noticed the repeated > > log > > > >> for > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ? > > > >> >> > > > >> >> I wonder if the following JIRA might be related: > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing > > daughter > > > >> >> > > > >> >> Thanks > > > >> >> > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[EMAIL PROTECTED] > > > > > >> wrote: > > > >> >> > > > >> >> > Strangely, I don't see any record of that region in the master
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2lars hofhansl 2012-11-06, 05:17
So it seems you can repeat this to some extend in 0.94.2, but you have never seen this before?
-- Lars ________________________________ From: Matt Corgan <[EMAIL PROTECTED]> To: dev <[EMAIL PROTECTED]> Sent: Monday, November 5, 2012 9:10 PM Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 It happened in this new table that I have all the logs for. The region in question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace of this region is on the regionserver with the "Intantiated tableName..." message 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 9784c21cbc2. I also know this region came from a recent split, but I can't find any log messages show the parent finishing the split that created this daughter region. So my guess now is that the split is actually finishing and letting clients continue to write data, but something is failing to print the log line and correctly tell the master about the new region. I've noticed that these regions are showing up on the clients in calls to HTable.getRegionLocations(), so the clients continue to function, but if I call HBaseAdmin.move() i get an UnknownRegionException. On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > I think what Matt encountered was this: > https://issues.apache.org/jira/browse/HBASE-7101 > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > Beats me - i can't find any record of it before that. > > > > I'm importing data into another table now. I disabled/enabled the table > > first to make sure we have the original 4 region locations logged > > everywhere. Will report back... > > > > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > > >wrote: > > > > > Wow, and can you figure how this happened? > > > > > > 2012-11-05 00:24:54,538 DEBUG > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > > > > > > > > > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > > > > > > Split? Else how did the master told that RS to open a region that > doesn't > > > exist? > > > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > > > > Here's one from last night with the master logs at the bottom. > > > > http://pastebin.com/cSdMbA2a > > > > > > > > I don't see that region in the master logs for 4 days back. I'm > going > > to > > > > import some new data from scratch soon and will be sure to keep all > the > > > > master logs. > > > > > > > > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> This reminds me a lot of > > > https://issues.apache.org/jira/browse/HBASE-4792 > > > >> > > > >> What I'd like to see is the log from the first time the master > > > >> receives the split message. I guess it says the region doesn't exist > > > >> anymore because the split was processed already in the master but > > > >> there's a failure mode similar to 4792. > > > >> > > > >> I saw this on another cluster last week but the logs were rolling > > > >> based on size so the original split message was lost. > > > >> > > > >> J-D > > > >> > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]> > > > wrote: > > > >> > I've had a few successful splits today without any errors. I'll > > turn > > > up > > > >> > the importer speed tomorrow to start stressing the cluster more. > > > >> > > > > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[EMAIL PROTECTED]> > wrote: > > > >> > > > > >> >> Matt: > > > >> >> Was there any region server crash before you noticed the repeated > > log > > > >> for > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ? > > > >> >> > > > >> >> I wonder if the following JIRA might be related:
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-06, 05:28
Yeah - we were running .94.0 since it came out but never saw it there.
I'll keep trying to narrow it down. The only harm it's causing is log spam and failing to move daughters to a new regionserver, which are definitely problems, but it's not bringing down the cluster. On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > So it seems you can repeat this to some extend in 0.94.2, but you have > never seen this before? > > > -- Lars > > > > ________________________________ > From: Matt Corgan <[EMAIL PROTECTED]> > To: dev <[EMAIL PROTECTED]> > Sent: Monday, November 5, 2012 9:10 PM > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > It happened in this new table that I have all the logs for. The region in > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace > of this region is on the regionserver with the "Intantiated tableName..." > message > > 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Instantiated > > StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 > 9784c21cbc2. > > I also know this region came from a recent split, but I can't find any log > messages show the parent finishing the split that created this daughter > region. So my guess now is that the split is actually finishing and > letting clients continue to write data, but something is failing to print > the log line and correctly tell the master about the new region. > > I've noticed that these regions are showing up on the clients in calls to > HTable.getRegionLocations(), so the clients continue to function, but if I > call HBaseAdmin.move() i get an UnknownRegionException. > > > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > I think what Matt encountered was this: > > https://issues.apache.org/jira/browse/HBASE-7101 > > > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > > > Beats me - i can't find any record of it before that. > > > > > > I'm importing data into another table now. I disabled/enabled the > table > > > first to make sure we have the original 4 region locations logged > > > everywhere. Will report back... > > > > > > > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Wow, and can you figure how this happened? > > > > > > > > 2012-11-05 00:24:54,538 DEBUG > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > > > > > > > > > > > > > > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > > > > > > > > Split? Else how did the master told that RS to open a region that > > doesn't > > > > exist? > > > > > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> > > wrote: > > > > > Here's one from last night with the master logs at the bottom. > > > > > http://pastebin.com/cSdMbA2a > > > > > > > > > > I don't see that region in the master logs for 4 days back. I'm > > going > > > to > > > > > import some new data from scratch soon and will be sure to keep all > > the > > > > > master logs. > > > > > > > > > > > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> This reminds me a lot of > > > > https://issues.apache.org/jira/browse/HBASE-4792 > > > > >> > > > > >> What I'd like to see is the log from the first time the master > > > > >> receives the split message. I guess it says the region doesn't > exist > > > > >> anymore because the split was processed already in the master but > > > > >> there's a failure mode similar to 4792. > > > > >> > > > > >> I saw this on another cluster last week but the logs were rolling > > > > >> based on size so the original split message was lost. > > > > >> > > > > >> J-D > > > > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[EMAIL PROTECTED]>
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2lars hofhansl 2012-11-06, 05:45
Took a brief look through all SPLIT related commits since 0.94.0... Found these:
HBASE-6854 * HBASE-6713 HBASE-6329 * HBASE-6088 HBASE-5986 HBASE-6070 * The ones marked with * are (IMHO) more likely to be related. -- Lars ________________________________ From: Matt Corgan <[EMAIL PROTECTED]> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> Sent: Monday, November 5, 2012 9:28 PM Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 Yeah - we were running .94.0 since it came out but never saw it there. I'll keep trying to narrow it down. The only harm it's causing is log spam and failing to move daughters to a new regionserver, which are definitely problems, but it's not bringing down the cluster. On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > So it seems you can repeat this to some extend in 0.94.2, but you have > never seen this before? > > > -- Lars > > > > ________________________________ > From: Matt Corgan <[EMAIL PROTECTED]> > To: dev <[EMAIL PROTECTED]> > Sent: Monday, November 5, 2012 9:10 PM > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > It happened in this new table that I have all the logs for. The region in > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first trace > of this region is on the regionserver with the "Intantiated tableName..." > message > > 2012-11-05 22:24:21,162 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Instantiated > > StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 > 9784c21cbc2. > > I also know this region came from a recent split, but I can't find any log > messages show the parent finishing the split that created this daughter > region. So my guess now is that the split is actually finishing and > letting clients continue to write data, but something is failing to print > the log line and correctly tell the master about the new region. > > I've noticed that these regions are showing up on the clients in calls to > HTable.getRegionLocations(), so the clients continue to function, but if I > call HBaseAdmin.move() i get an UnknownRegionException. > > > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > I think what Matt encountered was this: > > https://issues.apache.org/jira/browse/HBASE-7101 > > > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > > > Beats me - i can't find any record of it before that. > > > > > > I'm importing data into another table now. I disabled/enabled the > table > > > first to make sure we have the original 4 region locations logged > > > everywhere. Will report back... > > > > > > > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Wow, and can you figure how this happened? > > > > > > > > 2012-11-05 00:24:54,538 DEBUG > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > > > > > > > > > > > > > > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > > > > > > > > Split? Else how did the master told that RS to open a region that > > doesn't > > > > exist? > > > > > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> > > wrote: > > > > > Here's one from last night with the master logs at the bottom. > > > > > http://pastebin.com/cSdMbA2a > > > > > > > > > > I don't see that region in the master logs for 4 days back. I'm > > going > > > to > > > > > import some new data from scratch soon and will be sure to keep all > > the > > > > > master logs. > > > > > > > > > > > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> This reminds me a lot of > > > > https://issues.apache.org/jira/browse/HBASE-4792 > > > > >> > > > > >> What I'd like to see is the log from the first time the master
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2ramkrishna vasudevan 2012-11-06, 05:59
Could you attach the master logs at this time
2012-11-05 00:24:55? Regards Ram On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Took a brief look through all SPLIT related commits since 0.94.0... Found > these: > > HBASE-6854 * > HBASE-6713 > HBASE-6329 * > > HBASE-6088 > > HBASE-5986 > HBASE-6070 * > > > The ones marked with * are (IMHO) more likely to be related. > > -- Lars > > ________________________________ > From: Matt Corgan <[EMAIL PROTECTED]> > To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> > Sent: Monday, November 5, 2012 9:28 PM > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > Yeah - we were running .94.0 since it came out but never saw it there. > I'll keep trying to narrow it down. The only harm it's causing is log > spam and failing to move daughters to a new regionserver, which are > definitely problems, but it's not bringing down the cluster. > > > On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > So it seems you can repeat this to some extend in 0.94.2, but you have > > never seen this before? > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Matt Corgan <[EMAIL PROTECTED]> > > To: dev <[EMAIL PROTECTED]> > > Sent: Monday, November 5, 2012 9:10 PM > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > > > It happened in this new table that I have all the logs for. The region > in > > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first > trace > > of this region is on the regionserver with the "Intantiated tableName..." > > message > > > > 2012-11-05 22:24:21,162 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Instantiated > > > > > StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 > > 9784c21cbc2. > > > > I also know this region came from a recent split, but I can't find any > log > > messages show the parent finishing the split that created this daughter > > region. So my guess now is that the split is actually finishing and > > letting clients continue to write data, but something is failing to print > > the log line and correctly tell the master about the new region. > > > > I've noticed that these regions are showing up on the clients in calls to > > HTable.getRegionLocations(), so the clients continue to function, but if > I > > call HBaseAdmin.move() i get an UnknownRegionException. > > > > > > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > I think what Matt encountered was this: > > > https://issues.apache.org/jira/browse/HBASE-7101 > > > > > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> > wrote: > > > > > > > Beats me - i can't find any record of it before that. > > > > > > > > I'm importing data into another table now. I disabled/enabled the > > table > > > > first to make sure we have the original 4 region locations logged > > > > everywhere. Will report back... > > > > > > > > > > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans < > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Wow, and can you figure how this happened? > > > > > > > > > > 2012-11-05 00:24:54,538 DEBUG > > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > > > > > > > > > > > > > > > > > > > > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > > > > > > > > > > Split? Else how did the master told that RS to open a region that > > > doesn't > > > > > exist? > > > > > > > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[EMAIL PROTECTED]> > > > wrote: > > > > > > Here's one from last night with the master logs at the bottom. > > > > > > http://pastebin.com/cSdMbA2a > > > > > > > > > > > > I don't see that region in the master logs for 4 days back. I'm > > > going > > > > to > > > > > > import some new data from scratch soon and will be sure to keep
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2ramkrishna vasudevan 2012-11-06, 06:30
The log shows that the first time the region was transitioned to SPLITTING
even then it was not populated in the Master's memory. On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > Could you attach the master logs at this time > 2012-11-05 00:24:55? > > Regards > Ram > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED]>wrote: > >> Took a brief look through all SPLIT related commits since 0.94.0... Found >> these: >> >> HBASE-6854 * >> HBASE-6713 >> HBASE-6329 * >> >> HBASE-6088 >> >> HBASE-5986 >> HBASE-6070 * >> >> >> The ones marked with * are (IMHO) more likely to be related. >> >> -- Lars >> >> ________________________________ >> From: Matt Corgan <[EMAIL PROTECTED]> >> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> >> Sent: Monday, November 5, 2012 9:28 PM >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> >> Yeah - we were running .94.0 since it came out but never saw it there. >> I'll keep trying to narrow it down. The only harm it's causing is log >> spam and failing to move daughters to a new regionserver, which are >> definitely problems, but it's not bringing down the cluster. >> >> >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> >> wrote: >> >> > So it seems you can repeat this to some extend in 0.94.2, but you have >> > never seen this before? >> > >> > >> > -- Lars >> > >> > >> > >> > ________________________________ >> > From: Matt Corgan <[EMAIL PROTECTED]> >> > To: dev <[EMAIL PROTECTED]> >> > Sent: Monday, November 5, 2012 9:10 PM >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > >> > It happened in this new table that I have all the logs for. The region >> in >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first >> trace >> > of this region is on the regionserver with the "Intantiated >> tableName..." >> > message >> > >> > 2012-11-05 22:24:21,162 DEBUG >> org.apache.hadoop.hbase.regionserver.HRegion: >> > Instantiated >> > >> > >> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 >> > 9784c21cbc2. >> > >> > I also know this region came from a recent split, but I can't find any >> log >> > messages show the parent finishing the split that created this daughter >> > region. So my guess now is that the split is actually finishing and >> > letting clients continue to write data, but something is failing to >> > the log line and correctly tell the master about the new region. >> > >> > I've noticed that these regions are showing up on the clients in calls >> to >> > HTable.getRegionLocations(), so the clients continue to function, but >> if I >> > call HBaseAdmin.move() i get an UnknownRegionException. >> > >> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote: >> > >> > > I think what Matt encountered was this: >> > > https://issues.apache.org/jira/browse/HBASE-7101 >> > > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <[EMAIL PROTECTED]> >> wrote: >> > > >> > > > Beats me - i can't find any record of it before that. >> > > > >> > > > I'm importing data into another table now. I disabled/enabled the >> > table >> > > > first to make sure we have the original 4 region locations logged >> > > > everywhere. Will report back... >> > > > >> > > > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans < >> > [EMAIL PROTECTED] >> > > > >wrote: >> > > > >> > > > > Wow, and can you figure how this happened? >> > > > > >> > > > > 2012-11-05 00:24:54,538 DEBUG >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated >> > > > > >> > > > > >> > > > >> > > >> > >> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. >> > > > > >> > > > > Split? Else how did the master told that RS to open a region that >> > > doesn't >> > > > > exist?
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-06, 08:12
Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the regionserver in question (node 169), so i'm guessing that's 22f8fa73d8af837410ca270f344f6bb8's mommy. btw - you can see my balancer kick in 45 seconds later (runs every 10 minutes) here, but so far i think that's coincidence: 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster: BalanceSwitch=false I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169 and found all this stuff about a failed split: http://pastebin.com/xtXMZ388 and an attempted rollback. Looks like it errors out when it goes to put a node in ZK but it's already there. I'm not familiar with what a good split log looks like, so i'll stop commenting for now... On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > The log shows that the first time the region was transitioned to SPLITTING > even then it was not populated in the Master's memory. > > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < > [EMAIL PROTECTED]> wrote: > > > Could you attach the master logs at this time > > 2012-11-05 00:24:55? > > > > Regards > > Ram > > > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED] > >wrote: > > > >> Took a brief look through all SPLIT related commits since 0.94.0... > Found > >> these: > >> > >> HBASE-6854 * > >> HBASE-6713 > >> HBASE-6329 * > >> > >> HBASE-6088 > >> > >> HBASE-5986 > >> HBASE-6070 * > >> > >> > >> The ones marked with * are (IMHO) more likely to be related. > >> > >> -- Lars > >> > >> ________________________________ > >> From: Matt Corgan <[EMAIL PROTECTED]> > >> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> > >> Sent: Monday, November 5, 2012 9:28 PM > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > >> > >> Yeah - we were running .94.0 since it came out but never saw it there. > >> I'll keep trying to narrow it down. The only harm it's causing is log > >> spam and failing to move daughters to a new regionserver, which are > >> definitely problems, but it's not bringing down the cluster. > >> > >> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> > >> wrote: > >> > >> > So it seems you can repeat this to some extend in 0.94.2, but you have > >> > never seen this before? > >> > > >> > > >> > -- Lars > >> > > >> > > >> > > >> > ________________________________ > >> > From: Matt Corgan <[EMAIL PROTECTED]> > >> > To: dev <[EMAIL PROTECTED]> > >> > Sent: Monday, November 5, 2012 9:10 PM > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > >> > > >> > It happened in this new table that I have all the logs for. The > region > >> in > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the first > >> trace > >> > of this region is on the regionserver with the "Intantiated > >> tableName..." > >> > message > >> > > >> > 2012-11-05 22:24:21,162 DEBUG > >> org.apache.hadoop.hbase.regionserver.HRegion: > >> > Instantiated > >> > > >> > > >> > StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 > >> > 9784c21cbc2. > >> > > >> > I also know this region came from a recent split, but I can't find any > >> log > >> > messages show the parent finishing the split that created this > daughter > >> > region. So my guess now is that the split is actually finishing and > >> > letting clients continue to write data, but something is failing to > >> > the log line and correctly tell the master about the new region. > >> > > >> > I've noticed that these regions are showing up on the clients in calls > >> to > >> > HTable.getRegionLocations(), so the clients continue to function, but > >> if I > >> > call HBaseAdmin.move() i get an UnknownRegionException. > >> > > >> > > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2ramkrishna vasudevan 2012-11-06, 10:07
Thanks for the logs.
I found the reason. The following steps happen -> Initially the parent region P1 starts splitting. -> The split is going on normally. -> Another split starts at the same time for the same region P1. (Not sure why this started). -> Rollback happens seeing an already existing node. -> This node gets deleted in rollback and nodeDeleted Event starts. -> In nodeDeleted event the RIT for the region P1 gets deleted. -> Because of this there is no region in RIT. -> Now the first split gets over. Here the problem is we try to transit the node to SPLITTING to SPLIT. But the node even does not exist. But we don take any action on this. We think it is successful. -> Even before HBASE-6854 this could have happened. Will file a JIRA for the same. Regards Ram On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a. > Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the > regionserver in question (node 169), so i'm guessing that's > 22f8fa73d8af837410ca270f344f6bb8's mommy. > > btw - you can see my balancer kick in 45 seconds later (runs every 10 > minutes) here, but so far i think that's coincidence: > 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster: > BalanceSwitch=false > > I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169 > and found all this stuff about a failed split: > http://pastebin.com/xtXMZ388 and > an attempted rollback. Looks like it errors out when it goes to put a node > in ZK but it's already there. I'm not familiar with what a good split log > looks like, so i'll stop commenting for now... > > > On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan < > [EMAIL PROTECTED]> wrote: > > > The log shows that the first time the region was transitioned to > SPLITTING > > even then it was not populated in the Master's memory. > > > > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < > > [EMAIL PROTECTED]> wrote: > > > > > Could you attach the master logs at this time > > > 2012-11-05 00:24:55? > > > > > > Regards > > > Ram > > > > > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED] > > >wrote: > > > > > >> Took a brief look through all SPLIT related commits since 0.94.0... > > Found > > >> these: > > >> > > >> HBASE-6854 * > > >> HBASE-6713 > > >> HBASE-6329 * > > >> > > >> HBASE-6088 > > >> > > >> HBASE-5986 > > >> HBASE-6070 * > > >> > > >> > > >> The ones marked with * are (IMHO) more likely to be related. > > >> > > >> -- Lars > > >> > > >> ________________________________ > > >> From: Matt Corgan <[EMAIL PROTECTED]> > > >> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> > > >> Sent: Monday, November 5, 2012 9:28 PM > > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > >> > > >> Yeah - we were running .94.0 since it came out but never saw it there. > > >> I'll keep trying to narrow it down. The only harm it's causing is log > > >> spam and failing to move daughters to a new regionserver, which are > > >> definitely problems, but it's not bringing down the cluster. > > >> > > >> > > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> > > >> wrote: > > >> > > >> > So it seems you can repeat this to some extend in 0.94.2, but you > have > > >> > never seen this before? > > >> > > > >> > > > >> > -- Lars > > >> > > > >> > > > >> > > > >> > ________________________________ > > >> > From: Matt Corgan <[EMAIL PROTECTED]> > > >> > To: dev <[EMAIL PROTECTED]> > > >> > Sent: Monday, November 5, 2012 9:10 PM > > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > >> > > > >> > It happened in this new table that I have all the logs for. The > > region > > >> in > > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the > first > > >> trace > > >> > of this region is on the regionserver with the "Intantiated > > >> tableName..."
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2ramkrishna vasudevan 2012-11-06, 11:29
Raised HBASE-7103 for the same.
Regards Ram On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > Thanks for the logs. > I found the reason. > > The following steps happen > -> Initially the parent region P1 starts splitting. > -> The split is going on normally. > -> Another split starts at the same time for the same region P1. (Not sure > why this started). > -> Rollback happens seeing an already existing node. > -> This node gets deleted in rollback and nodeDeleted Event starts. > -> In nodeDeleted event the RIT for the region P1 gets deleted. > -> Because of this there is no region in RIT. > -> Now the first split gets over. Here the problem is we try to transit > the node to SPLITTING to SPLIT. But the node even does not exist. > But we don take any action on this. We think it is successful. > -> Even before HBASE-6854 this could have happened. Will file a JIRA for > the same. > > Regards > Ram > > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > >> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a. >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the >> regionserver in question (node 169), so i'm guessing that's >> 22f8fa73d8af837410ca270f344f6bb8's mommy. >> >> btw - you can see my balancer kick in 45 seconds later (runs every 10 >> minutes) here, but so far i think that's coincidence: >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster: >> BalanceSwitch=false >> >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169 >> and found all this stuff about a failed split: >> http://pastebin.com/xtXMZ388 and >> an attempted rollback. Looks like it errors out when it goes to put a >> node >> in ZK but it's already there. I'm not familiar with what a good split log >> looks like, so i'll stop commenting for now... >> >> >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan < >> [EMAIL PROTECTED]> wrote: >> >> > The log shows that the first time the region was transitioned to >> SPLITTING >> > even then it was not populated in the Master's memory. >> > >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Could you attach the master logs at this time >> > > 2012-11-05 00:24:55? >> > > >> > > Regards >> > > Ram >> > > >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED] >> > >wrote: >> > > >> > >> Took a brief look through all SPLIT related commits since 0.94.0... >> > Found >> > >> these: >> > >> >> > >> HBASE-6854 * >> > >> HBASE-6713 >> > >> HBASE-6329 * >> > >> >> > >> HBASE-6088 >> > >> >> > >> HBASE-5986 >> > >> HBASE-6070 * >> > >> >> > >> >> > >> The ones marked with * are (IMHO) more likely to be related. >> > >> >> > >> -- Lars >> > >> >> > >> ________________________________ >> > >> From: Matt Corgan <[EMAIL PROTECTED]> >> > >> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> >> > >> Sent: Monday, November 5, 2012 9:28 PM >> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > >> >> > >> Yeah - we were running .94.0 since it came out but never saw it >> there. >> > >> I'll keep trying to narrow it down. The only harm it's causing is >> log >> > >> spam and failing to move daughters to a new regionserver, which are >> > >> definitely problems, but it's not bringing down the cluster. >> > >> >> > >> >> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> >> > >> wrote: >> > >> >> > >> > So it seems you can repeat this to some extend in 0.94.2, but you >> have >> > >> > never seen this before? >> > >> > >> > >> > >> > >> > -- Lars >> > >> > >> > >> > >> > >> > >> > >> > ________________________________ >> > >> > From: Matt Corgan <[EMAIL PROTECTED]> >> > >> > To: dev <[EMAIL PROTECTED]> >> > >> > Sent: Monday, November 5, 2012 9:10 PM >> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >>
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2lars hofhansl 2012-11-06, 15:13
Thanks Ram!
----- Original Message ----- From: ramkrishna vasudevan <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Tuesday, November 6, 2012 3:29 AM Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 Raised HBASE-7103 for the same. Regards Ram On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > Thanks for the logs. > I found the reason. > > The following steps happen > -> Initially the parent region P1 starts splitting. > -> The split is going on normally. > -> Another split starts at the same time for the same region P1. (Not sure > why this started). > -> Rollback happens seeing an already existing node. > -> This node gets deleted in rollback and nodeDeleted Event starts. > -> In nodeDeleted event the RIT for the region P1 gets deleted. > -> Because of this there is no region in RIT. > -> Now the first split gets over. Here the problem is we try to transit > the node to SPLITTING to SPLIT. But the node even does not exist. > But we don take any action on this. We think it is successful. > -> Even before HBASE-6854 this could have happened. Will file a JIRA for > the same. > > Regards > Ram > > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > >> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a. >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the >> regionserver in question (node 169), so i'm guessing that's >> 22f8fa73d8af837410ca270f344f6bb8's mommy. >> >> btw - you can see my balancer kick in 45 seconds later (runs every 10 >> minutes) here, but so far i think that's coincidence: >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster: >> BalanceSwitch=false >> >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169 >> and found all this stuff about a failed split: >> http://pastebin.com/xtXMZ388 and >> an attempted rollback. Looks like it errors out when it goes to put a >> node >> in ZK but it's already there. I'm not familiar with what a good split log >> looks like, so i'll stop commenting for now... >> >> >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan < >> [EMAIL PROTECTED]> wrote: >> >> > The log shows that the first time the region was transitioned to >> SPLITTING >> > even then it was not populated in the Master's memory. >> > >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Could you attach the master logs at this time >> > > 2012-11-05 00:24:55? >> > > >> > > Regards >> > > Ram >> > > >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED] >> > >wrote: >> > > >> > >> Took a brief look through all SPLIT related commits since 0.94.0... >> > Found >> > >> these: >> > >> >> > >> HBASE-6854 * >> > >> HBASE-6713 >> > >> HBASE-6329 * >> > >> >> > >> HBASE-6088 >> > >> >> > >> HBASE-5986 >> > >> HBASE-6070 * >> > >> >> > >> >> > >> The ones marked with * are (IMHO) more likely to be related. >> > >> >> > >> -- Lars >> > >> >> > >> ________________________________ >> > >> From: Matt Corgan <[EMAIL PROTECTED]> >> > >> To: dev <[EMAIL PROTECTED]>; lars hofhansl <[EMAIL PROTECTED]> >> > >> Sent: Monday, November 5, 2012 9:28 PM >> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > >> >> > >> Yeah - we were running .94.0 since it came out but never saw it >> there. >> > >> I'll keep trying to narrow it down. The only harm it's causing is >> log >> > >> spam and failing to move daughters to a new regionserver, which are >> > >> definitely problems, but it's not bringing down the cluster. >> > >> >> > >> >> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[EMAIL PROTECTED]> >> > >> wrote: >> > >> >> > >> > So it seems you can repeat this to some extend in 0.94.2, but you >> have >> > >> > never seen this before? >> > >> > >> > >> > >> > >> > -- Lars >> > >> > >> > >> > >> > >> > >> > >> > ________________________________
-
Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2Matt Corgan 2012-11-06, 23:36
I dug back further to find the origin of e11e8b030897d6e5b973f8fe892e0eb2
to see if it had previous problems that left ZK in a bad state. Here's the regionserver and master logs: http://pastebin.com/qcvHjNCg from about 2 hours earlier. * Nov 4, 22:34: region is created as daugher b of a split on node 159 * Nov 4, 22:35: moved to from node 159 to 169 by HBaseAdmin.move() * Nov 5, 00:24: node 169 tries to split the region but gets Failed create of ephemeral /hbase/unassigned/e11e8b030897d6e5b973f8fe892e0eb2 Is it possible that if something calls HBaseAdmin.move() on a daugher region that is 30 seconds old, it could move the region but leave that ZK node in a bad state? On Tue, Nov 6, 2012 at 7:13 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Thanks Ram! > > > > > > ----- Original Message ----- > From: ramkrishna vasudevan <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Tuesday, November 6, 2012 3:29 AM > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > Raised HBASE-7103 for the same. > > Regards > Ram > > On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan < > [EMAIL PROTECTED]> wrote: > > > Thanks for the logs. > > I found the reason. > > > > The following steps happen > > -> Initially the parent region P1 starts splitting. > > -> The split is going on normally. > > -> Another split starts at the same time for the same region P1. (Not > sure > > why this started). > > -> Rollback happens seeing an already existing node. > > -> This node gets deleted in rollback and nodeDeleted Event starts. > > -> In nodeDeleted event the RIT for the region P1 gets deleted. > > -> Because of this there is no region in RIT. > > -> Now the first split gets over. Here the problem is we try to transit > > the node to SPLITTING to SPLIT. But the node even does not exist. > > But we don take any action on this. We think it is successful. > > -> Even before HBASE-6854 this could have happened. Will file a JIRA for > > the same. > > > > Regards > > Ram > > > > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote: > > > >> Ram, here's the master log corresponding to > http://pastebin.com/cSdMbA2a. > >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the > >> regionserver in question (node 169), so i'm guessing that's > >> 22f8fa73d8af837410ca270f344f6bb8's mommy. > >> > >> btw - you can see my balancer kick in 45 seconds later (runs every 10 > >> minutes) here, but so far i think that's coincidence: > >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster: > >> BalanceSwitch=false > >> > >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node > 169 > >> and found all this stuff about a failed split: > >> http://pastebin.com/xtXMZ388 and > >> an attempted rollback. Looks like it errors out when it goes to put a > >> node > >> in ZK but it's already there. I'm not familiar with what a good split > log > >> looks like, so i'll stop commenting for now... > >> > >> > >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan < > >> [EMAIL PROTECTED]> wrote: > >> > >> > The log shows that the first time the region was transitioned to > >> SPLITTING > >> > even then it was not populated in the Master's memory. > >> > > >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> > > Could you attach the master logs at this time > >> > > 2012-11-05 00:24:55? > >> > > > >> > > Regards > >> > > Ram > >> > > > >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[EMAIL PROTECTED] > >> > >wrote: > >> > > > >> > >> Took a brief look through all SPLIT related commits since 0.94.0... > >> > Found > >> > >> these: > >> > >> > >> > >> HBASE-6854 * > >> > >> HBASE-6713 > >> > >> HBASE-6329 * > >> > >> > >> > >> HBASE-6088 > >> > >> > >> > >> HBASE-5986 > >> > >> HBASE-6070 * > >> > >> > >> > >> > >> > >> The ones marked with * are (IMHO) more likely to be related. |