|
|
-
CombineFileInputFormat only keeps one location per block and throws poor exception when passing in directoryJim Donofrio 2012-08-14, 03:59
1. Why does CombineFileInputFormat in trunk only keep one location per
block both for the non splitable and splitable cases. I can understand why you only output one location when combining many blocks into the same split in order to force the scheduler the choose that node. But when a file is only one block, why unnecessarily restrict the scheduler to one node when there are probably 2 others? This has causes some of my jobs to wait for a long time waiting for mappers. 2. CombineFileInputFormat uses FileInputFormat.listStatus but does not complete the same check as FileInputFormat does below. This causes an ArrayOutOfBounds exception to get thrown from the OneFileInfo constructor. CombineFileInputFormat should complete the same check below to avoid confusing users when they accidentally pass in a directory of directories. for (FileStatus file: files) { // check we have valid files if (file.isDir()) { throw new IOException("Not a file: "+ file.getPath()); } } if (!isSplitable) { // if the file is not splitable, just create the one block with // full file length blocks = new OneBlockInfo[1]; fileSize = stat.getLen(); blocks[0] = new OneBlockInfo(path, 0, fileSize, locations[0] .getHosts(), locations[0].getTopologyPaths()); } else { ArrayList<OneBlockInfo> blocksList = new ArrayList<OneBlockInfo>( locations.length); for (int i = 0; i < locations.length; i++) { fileSize += locations[i].getLength(); // each split can be a maximum of maxSize long left = locations[i].getLength(); long myOffset = locations[i].getOffset(); long myLength = 0; do { if (maxSize == 0) { myLength = left; } else { if (left > maxSize && left < 2 * maxSize) { // if remainder is between max and 2*max - then // instead of creating splits of size max, left-max we // create splits of size left/2 and left/2. This is // a heuristic to avoid creating really really small // splits. myLength = left / 2; } else { myLength = Math.min(maxSize, left); } } OneBlockInfo oneblock = new OneBlockInfo(path, myOffset, myLength, locations[i].getHosts(), locations[i] .getTopologyPaths()); left -= myLength; myOffset += myLength; blocksList.add(oneblock); } while (left > 0); } |