Look at these beautiful columns of trunk and 0.95 blue and green dots:
+ Andrew's ec2 build: http://126.96.36.199/ (click on the 0.95 and trunk
+ Apache: https://builds.apache.org/view/H-L/view/HBase/ (ditto)
Thanks all who helped get test passing over the hump (Jimmy, Matteo,
Jeffrey, JD, etc.).
There are still a few flakies and it looks like DistributedLogSplitting can
go 'invisible' on occasion but their time is nigh!
>From here on out, lets keep the dots blue or green. A failed test though
it may seem unrelated probably is related somehow so I'd suggest paying
closer attention to fails from here on out (sign up for the builds mailing
list if you have not already).
I'd also suggest that we tend away from big fat integration-type unit
tests. Apache infrastructure is overloaded and it a PITA setting timeouts
and retries so tests will pass in this "hostile" setting. Consider making
an hbase-it contrib. instead. These are run w/ less regularity but are
approaching quotidian hopefully on a test rig near you.
On Mon, Jul 22, 2013 at 11:49 AM, Stack <[EMAIL PROTECTED]> wrote:
> Below is a state of hbase 0.95/trunk unit tests (Includes a little
> taxonomy of test failure type definitions).
> On Andrew's ec2 build box, 0.95 is passing most of the time:
> It is not as good on Apache build box but it is getting better:
> On Apache, I have seen loads up in the 500s and all file descriptors used
> according to the little resources report printed at the end of each test.
> If these numbers are to be believed (TBD), we may never achieve 100% pass
> rate on Apache builds.
> Andrew's ec2 builds run the integration tests too where the apache builds
> do not -- sometimes we'll fail an integration test run which makes the
> Andrew ec2 red/green ratio look worse that it actually is.
> Trunk builds lag. They are being worked on.
> We seem to be over the worst of the flakey unit tests. We have a few
> stragglers still but they are being hunted down by the likes of the
> merciless Jimmy Xiang and Jeffrey Zhong.
> The "zombies" have been mostly nailed too (where "zombies" are tests that
> refuse to die continuing after the suite has completed causing the build to
> fail). The zombie trap from test-patch.sh was ported over to apache and
> ec2 build and it caught the last of undying.
> We are now into a new phase where "all" tests pass but the build still
> fails. Here is an example:
> http://188.8.131.52/job/HBase-TRUNK/429/org.apache.hbase$hbase-server/ The only clue I have to go on is the fact that when we fail, the number of
> tests run is less than the total that shows for a successful run.
> Unless anyone has a better idea, to figure why the hang, I compare the
> list of tests that show in a good run vs. those of a bad run. Tests that
> are in the good run but missing from the bad run are deemed suspect. In
> the absence of other evidence or other ideas, I am blaming these
> "invisibles" for the build fail.
> Here is an example:
> This is a good 0.95 hadoop2 run (notice how we are running integration
> tests tooooo and they succeed!! On hadoop2!!!!):
> In hbase-server module:
> Tests run: 1491, Failures: 0, Errors: 0, Skipped: 19
> This is a bad run:
> Tests run: 1458, Failures: 0, Errors: 0, Skipped: 18
> If I compare tests, the successful run has:
> > Running org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed
> ... where the bad run does not show the above test.
> TestHLogSplitCompressed has 34 tests one of which is disabled so that
> would seem to account for the discrepancy.
> I've started to disable tests that fail likes this putting them aside for