Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - Getting unit tests to pass


Copy link to this message
-
Re: Getting unit tests to pass
Stack 2013-07-23, 04:13
By way of illustration of how loaded Apache build boxes can be:

Thread LEAK? -, OpenFileDescriptor=174 (was 162) - OpenFileDescriptor
LEAK? -, MaxFileDescriptor=40000 (was 40000), SystemLoadAverage=351
(was 383), ProcessCount=142 (was 144), AvailableMemoryMB=819 (was
892), ConnectionCount=0 (was 0)

This seems to have caused a test that usually passes to fail:
https://issues.apache.org/jira/browse/HBASE-9023

St.Ack
On Mon, Jul 22, 2013 at 11:49 AM, Stack <[EMAIL PROTECTED]> wrote:

> Below is a state of hbase 0.95/trunk unit tests (Includes a little
> taxonomy of test failure type definitions).
>
> On Andrew's ec2 build box, 0.95 is passing most of the time:
>
> http://54.241.6.143/job/HBase-0.95/
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/
>
> It is not as good on Apache build box but it is getting better:
>
> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95/
> https://builds.apache.org/view/H-L/view/HBase/job/hbase-0.95-on-hadoop2/
>
> On Apache, I have seen loads up in the 500s and all file descriptors used
> according to the little resources report printed at the end of each test.
>  If these numbers are to be believed (TBD), we may never achieve 100% pass
> rate on Apache builds.
>
> Andrew's ec2 builds run the integration tests too where the apache builds
> do not -- sometimes we'll fail an integration test run which makes the
> Andrew ec2 red/green ratio look worse that it actually is.
>
> Trunk builds lag.  They are being worked on.
>
> We seem to be over the worst of the flakey unit tests.  We have a few
> stragglers still but they are being hunted down by the likes of the
> merciless Jimmy Xiang and Jeffrey Zhong.
>
> The "zombies" have been mostly nailed too (where "zombies" are tests that
> refuse to die continuing after the suite has completed causing the build to
> fail).  The zombie trap from test-patch.sh was ported over to apache and
> ec2 build and it caught the last of undying.
>
> We are now into a new phase where "all" tests pass but the build still
> fails.  Here is an example:
> http://54.241.6.143/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of
> tests run is less than the total that shows for a successful run.
>
> Unless anyone has a better idea, to figure why the hang, I compare the
> list of tests that show in a good run vs. those of a bad run.  Tests that
> are in the good run but missing from the bad run are deemed suspect.  In
> the absence of  other evidence or other ideas, I am blaming these
> "invisibles" for the build fail.
>
> Here is an example:
>
> This is a good 0.95 hadoop2 run (notice how we are running integration
> tests tooooo and they succeed!!  On hadoop2!!!!):
>
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/669/
>
> In hbase-server module:
>
> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
>
>
> This is a bad run:
>
> http://54.241.6.143/job/HBase-0.95-Hadoop-2/668/
>
> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
>
>
> If I compare tests, the successful run has:
>
> > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
>
>
> ... where the bad run does not show the above test.
>  TestHLogSplitCompressed has 34 tests one of which is disabled so that
> would seem to account for the discrepancy.
>
> I've started to disable tests that fail likes this putting them aside for
> original authors or the interested to take a look to see why they fail
> occasionally.  I put them aside so we can enjoy passing builds in the
> meantime.  I've already moved aside or disabled a few tests and test
> classes:
>
> TestMultiTableInputFormat
> TestReplicationKillSlaveRS
> TestHCM.testDeleteForZKConnLeak was disabled
>
> ... and a few others.
>
> Finally (if you are still reading), I would suggest that test failures in
> hadoopqa are now more worthy of investigation.   Illustrative is what
> happened recently around "HBASE-8983 HBaseConnection#deleteAllConnections"
> where the patch had +1s and on its first run, a unit test failed (though it