-Re: How to test Hadoop MapReduce under another File System NOT HDFS
Harsh J 2013-02-21, 13:23
On Thu, Feb 21, 2013 at 6:42 PM, Ling Kun <[EMAIL PROTECTED]> wrote:
> Dear all,
> I am currently look into some other filesystem implementation, like
> lustre, gluster, or other NAS with POSIX support, and trying to replace HDFS
> with it.
> I have implement a filesystem class( AFS) which will provide interface
> to Hadoop MapReduce, like the one of RawLocalFileSystem, and examples like
> wordcount, terasort works well.
> However, I am not sure whether my implementation is correct for all the
> MapReduce applications that Hadoop MapReduce+Hadoop HDFS can run.
> My question is :
> 1. How Hadoop community do MapReduce regression test for any update of
> Hadoop HDFS and Hadoop MapReduce
We have several unit tests under HDFS sources (you can view it in the
sources) which catch quite a bit of regressions, if not all, and for
performance differences we manually test with real-life workloads,
aside of generic stressing tests such as teragen/sort. The Apache
Bigtop project also has integration level tests for the Apache Hadoop
ecosystem, which adds onto the stack.
> 2. Beside MapReduce wordcount and Terasort examples, are there any missing
> filesystem interface support for MapReduce application. Since the FileSystem
> has POSIX support, the hsync have also supported.
Not sure what you mean here. The major bit of HDFS that MR makes good
use of is not the FS impl but the ability of HDFS to expose its
locations of replicas to MR for it to schedule its tasks better. This
is done by implementing the
form of APIs in HDFS. The blocks feature of HDFS increases processing
parallelism potentials, and the locality APIs help leverage that.
> 3. According to my test, the performance is worse than the HDFS+MapReduce.
> Any suggestion or hint on the performance analysis? ( Without MapReduce, the
> performance of the filesystem is better than HDFS and also local
> 3.1 the following are the same for the performance comparation:
> 3.1.1 architecture: 4 node for MR, and another different 4 nodes for
> 3.1.2 application: the input size , the number of mapper and reducers are
> the same.
Am guessing most likely your trouble is in exposing proper information
to MR for scheduling. HDFS gives 3 locations per block, for example,
so the MR scheduler has a small collection to choose from. You can
devise other tests but what you've thought of above is a good, simple