Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> CombineFileInputFormat only keeps one location per block and throws poor exception when passing in directory


Copy link to this message
-
CombineFileInputFormat only keeps one location per block and throws poor exception when passing in directory
1. Why does CombineFileInputFormat in trunk only keep one location per
block both for the non splitable and splitable cases. I can understand why
you only output one location when combining many blocks into the same split
in order to force the scheduler the choose that node. But when a file is
only one block, why unnecessarily restrict the scheduler to one node when
there are probably 2 others? This has causes some of my jobs to wait for a
long time waiting for mappers.

2. CombineFileInputFormat uses FileInputFormat.listStatus but does not
complete the same check as FileInputFormat does below. This causes an
ArrayOutOfBounds exception to get thrown from the OneFileInfo constructor.
CombineFileInputFormat should complete the same check below to avoid
confusing users when they accidentally pass in a directory of directories.
    for (FileStatus file: files) {                // check we have valid
files
      if (file.isDir()) {
        throw new IOException("Not a file: "+ file.getPath());
      }
    }
        if (!isSplitable) {
          // if the file is not splitable, just create the one block with
          // full file length
          blocks = new OneBlockInfo[1];
          fileSize = stat.getLen();
          blocks[0] = new OneBlockInfo(path, 0, fileSize, locations[0]
              .getHosts(), locations[0].getTopologyPaths());
        } else {
          ArrayList<OneBlockInfo> blocksList = new ArrayList<OneBlockInfo>(
              locations.length);
          for (int i = 0; i < locations.length; i++) {
            fileSize += locations[i].getLength();

            // each split can be a maximum of maxSize
            long left = locations[i].getLength();
            long myOffset = locations[i].getOffset();
            long myLength = 0;
            do {
              if (maxSize == 0) {
                myLength = left;
              } else {
                if (left > maxSize && left < 2 * maxSize) {
                  // if remainder is between max and 2*max - then
                  // instead of creating splits of size max, left-max we
                  // create splits of size left/2 and left/2. This is
                  // a heuristic to avoid creating really really small
                  // splits.
                  myLength = left / 2;
                } else {
                  myLength = Math.min(maxSize, left);
                }
              }
              OneBlockInfo oneblock = new OneBlockInfo(path, myOffset,
                  myLength, locations[i].getHosts(), locations[i]
                      .getTopologyPaths());
              left -= myLength;
              myOffset += myLength;

              blocksList.add(oneblock);
            } while (left > 0);
          }
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB