Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # dev >> Bugs in hadoop-0.23.0


Copy link to this message
-
Re: Bugs in hadoop-0.23.0
Thanks Anty!

Can you please go ahead and file jira(s) for them? Of course, we'd love a patch too... :)

Arun

On Jan 17, 2012, at 10:05 PM, Anty wrote:

> Hi:
>  After reading/testing the code of reduce phase code in hadoop-0.23.0,  I
> think i found some bugs, maybe helpful to you .
> 1)  in the constructor of OnDiskMerger in MergeManager.java on line 472,
>   *public OnDiskMerger(MergeManager<K, V> manager) {
>     super(manager, Integer.MAX_VALUE, exceptionReporter);
>     setName("OnDiskMerger - Thread to merge on-disk map-outputs");
>     setDaemon(true);
>   }*
> the second parameter mergeFactor in constructor can't be *Integer.MAX_VALUE,
> *it should be io.sort.factor. if set mergeFactor to *Integer.MAX_VALUE*
> the OnDiskMerger will merge all files feed to it at a time , don't show
> respect to io.sort.factor parameter.
>
> 2) still in MergeManager.java on line 90, the data structure
>  *Set<Path> onDiskMapOutputs = new TreeSet<Path>();*
> is incorrect, you should sort the on disk files by file length, not the uri
> of Path.
> So the files feed to OnDiskMerger is not sorted by length, hurt the overall
> performance.
>
> 3) still in MergeManager.java being from line 745
>
>      if (0 != onDiskBytes) {
>      final int numInMemSegments = memDiskSegments.size();
>      diskSegments.addAll(0, memDiskSegments);
>      memDiskSegments.clear();
>      // Pass mergePhase only if there is a going to be intermediate
>      // merges. See comment where mergePhaseFinished is being set
>      Progress thisPhase = (mergePhaseFinished) ? null : mergePhase;
>      RawKeyValueIterator diskMerge =* Merger.merge(
>          job, fs, keyClass, valueClass, diskSegments,
>          ioSortFactor, numInMemSegments, tmpDir, comparator,
>          reporter, false, spilledRecordsCounter, null, thisPhase);*
>      diskSegments.clear();
>      if (0 == finalSegments.size()) {
>        return diskMerge;
>      }
>      finalSegments.add(new Segment<K,V>(
>            new RawKVIteratorReader(diskMerge, onDiskBytes), true));
>    }
>
> the above bold code which will merge files down to io.sort.factor, maybe
> have intermediate merge process .
> What's wrong with that you didn't pass in the *codec *parameter, the
> intermediate merge process will not compress
> the written file on disk, leading to huge performance degrade.
>
>
> 4) the implementation of Reader in IFile.java is(maybe) not very efficient,
> in previous hadoop version, each time read a buffer from the disk, no need
> to hit disk for each key value. I don't known why rewrite this codes.
>
>
> hopefully I don't miss anything.
>
>
> --
> Best Regards
> Anty Rao
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB