Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - Re: Build failed in Jenkins: HBase-TRUNK #2225


Copy link to this message
-
Re: Build failed in Jenkins: hbase-0.90 #324
Gary Helmling 2011-10-14, 18:55
So the latest 0.90 build passed and I ran TestLogRolling in a batch of
20 runs with no failures, so there must be an infrequent timing issue
in the test.

In the test output from the build, it looks like one of the region
servers aborted:

2011-10-14 04:40:56,092 FATAL
[RegionServer:1;vesta.apache.org,59247,1318566556886.logRoller]
regionserver.HRegionServer(1410): ABORTING region server
serverName=vesta.apache.org,59247,1318566556886, load=(requests=0,
regions=1, usedHeap=155, maxHeap=1244): Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1318567028016
at org.apache.hadoop.hbase.regionserver.wal.HLog.cleanupCurrentWriter(HLog.java:787)
at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:559)
at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96)
Caused by: java.io.IOException: Error Recovery for block
blk_6581012269291675208_1075 failed  because recovery from primary
datanode 127.0.0.1:48952 failed 6 times.  Pipeline was
127.0.0.1:48952. Aborting...
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371)
So the WAL was split and no longer existed when we tried to read it
later for verification.  Some changes to the test to more strictly
control when sync is triggered might avoid this.  Though that will
probably require separating out testLogRollOnPipelineRestart into a
separate test case so it can control the cluster config with
interfering with the other 2 tests.  Maybe actually moving it in to
TestLogRollAbort would make sense, since we're testing when we should
abort vs. when we shouldn't.

I'll open a follow up JIRA.

--gh
On Fri, Oct 14, 2011 at 12:39 AM, Gary Helmling <[EMAIL PROTECTED]> wrote:
>
> From Jenkins, the failure was in TestLogRolling.testLogRollOnPipelineRestart, so looks like it.  But TestLogRolling did pass for me locally prior to commit.  Re-running in a batch to see if it's intermittent...
>
>
> On Thu, Oct 13, 2011 at 11:52 PM, Stack <[EMAIL PROTECTED]> wrote:
>>
>> Do you think this fail because of your change Gary?
>>
>> Good on you,
>> St.Ack
>>
>> On Fri, Oct 14, 2011 at 6:06 AM, Apache Jenkins Server
>> <[EMAIL PROTECTED]> wrote:
>> > See <https://builds.apache.org/job/hbase-0.90/324/changes>
>> >
>> > Changes:
>> >
>> > [Gary Helmling] HBASE-4282  RegionServer should abort when WAL close fails with unflushed edits
>> >
>> > ------------------------------------------
>> > [...truncated 1218 lines...]
>> > Running org.apache.hadoop.hbase.mapred.TestTableInputFormat
>> > Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 20.817 sec
>> > Running org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
>> > Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 129.784 sec
>> > Running org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable
>> > Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 25.575 sec
>> > Running org.apache.hadoop.hbase.regionserver.TestFSErrorsExposed
>> > Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 29.235 sec
>> > Running org.apache.hadoop.hbase.client.replication.TestReplicationAdmin
>> > Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.606 sec
>> > Running org.apache.hadoop.hbase.regionserver.TestScanDeleteTracker
>> > Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.147 sec
>> > Running org.apache.hadoop.hbase.client.TestMetaScanner
>> > Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.631 sec
>> > Running org.apache.hadoop.hbase.metrics.TestMetricsMBeanBase
>> > Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec
>> > Running org.apache.hadoop.hbase.TestRegionRebalancing
>> > Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 80.979 sec