Im wondering - if I'm running mapreduce jobs on a cluster with large block
sizes - can i increase performance with either:
1) A custom FileInputFormat
2) A custom partitioner
Clearly, (3) will be an issue due to the fact that it might overload tasks
and network traffic... but maybe (1) or (2) will be a precise way to "use"
partitions as a "poor mans" block.
Just a thought - not sure if anyone has tried (1) or (2) before in order to
simulate blocks and increase locality by utilizing the partition API.