Most of our map jobs are IO bound. However, for the same node, the IO throughput during the map phase is only 20% of its real sequential IO capability (we tested the sequential IO throughput by iozone)
I think the reason is that while each map has a sequential IO request, since there are many maps concurrently running on the same node, this causes quite expensive IO switches.
Prefetch may be a good solution here especially a map job is supposed to scan through an entire block and no more no less. Any idea how to enable it?