I'm seeing an odd storage performance problem that I hope can be fixed with the right configuration parameter, but nothing I've tried so far helps. These tests were done in a virtual machine running on ESX, but earlier tests on native RHEL showed something similar.
7 nodes with 10 GbE interconnect.
Each node: 2 socket Westmere, 96 GB, 10 local SATA disks exported to the VM as JBODs, single 92 GB VM.
TestDFSIO: 140 files, 7143 MB each (about 1 TB total data), so 2 map tasks per disk. Replication=2.
Case A: RHEL 5.5, EXT3 file system, write through configured on the physical disk
Case B: RHEL 6.1, EXT4 FS, write back
Testing with aio-stress shows that the changes made in Case B all improved efficiency and performance. But running the write test of TestDFSIO on hadoop (using CDH3u0) got worse:
Case A: 580 seconds exec time
Case B: 740 seconds
I can improve Case B to 710 seconds by going back to EXT3, or by mounting EXT4 with min_batch_time=2000, so slowing down the FS improves hadoop performance.
Both cases show a peak write throughput of about 550 MB/s on each node. The difference is that Case A the throughput is steady and doesn't drop below 500 MB/s, but in B it is very noisy, sometimes going all the way to 0. It is also sometimes periodic, rising and falling with a 15-30 second period. That period is synchronized across all the nodes. 550 MB/s appears to be a controller limit, each disk alone is capable of 130 MB/s (with a raw partition or EXT4, EXT3 is about 100 MB/s). I tried replication=1 to eliminate nearly all networking, but storage throughput was still not steady.
I'm thinking that faster storage somehow confuses the scheduler, but I don't see what the mechanism is. Any ideas what's going on or things to try? I don't want to have to recommend de-tuning storage in order to get hadoop to behave.
Thanks for the help,