This might be a very easy question, but I was wondering how the Accumulo
Input Format handled a tablet file splitting over multiple nodes.
For example, if I have a tablet file that is 1GB large, where my hadoop
block size is 256MB. Then there is a possibility that up to 4 nodes could
be holding the data from my tablet file. However, when Accumulo Input
Format creates mappers, it creates a mapper for every tablet. This might
mean that 3 blocks are transferred over the network to where the mapper is
running to ensure data locality.
Am I correct in this assumption? Or is there something else the
TabletServer is doing underneath to make sure all the data actually resides
in one server, so there is no network overhead of moving blocks before a
Map Reduce job.