|
|
-
Re: How to test Hadoop MapReduce under another File System NOT HDFSHarsh J 2013-02-21, 13:23
Hi Ling,
On Thu, Feb 21, 2013 at 6:42 PM, Ling Kun <[EMAIL PROTECTED]> wrote: > Dear all, > I am currently look into some other filesystem implementation, like > lustre, gluster, or other NAS with POSIX support, and trying to replace HDFS > with it. > > I have implement a filesystem class( AFS) which will provide interface > to Hadoop MapReduce, like the one of RawLocalFileSystem, and examples like > wordcount, terasort works well. > > However, I am not sure whether my implementation is correct for all the > MapReduce applications that Hadoop MapReduce+Hadoop HDFS can run. > > My question is : > 1. How Hadoop community do MapReduce regression test for any update of > Hadoop HDFS and Hadoop MapReduce We have several unit tests under HDFS sources (you can view it in the sources) which catch quite a bit of regressions, if not all, and for performance differences we manually test with real-life workloads, aside of generic stressing tests such as teragen/sort. The Apache Bigtop project also has integration level tests for the Apache Hadoop ecosystem, which adds onto the stack. > 2. Beside MapReduce wordcount and Terasort examples, are there any missing > filesystem interface support for MapReduce application. Since the FileSystem > has POSIX support, the hsync have also supported. Not sure what you mean here. The major bit of HDFS that MR makes good use of is not the FS impl but the ability of HDFS to expose its locations of replicas to MR for it to schedule its tasks better. This is done by implementing the http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long) form of APIs in HDFS. The blocks feature of HDFS increases processing parallelism potentials, and the locality APIs help leverage that. > 3. According to my test, the performance is worse than the HDFS+MapReduce. > Any suggestion or hint on the performance analysis? ( Without MapReduce, the > performance of the filesystem is better than HDFS and also local > filesystem). > 3.1 the following are the same for the performance comparation: > 3.1.1 architecture: 4 node for MR, and another different 4 nodes for > HDFS/AFS > 3.1.2 application: the input size , the number of mapper and reducers are > the same. Am guessing most likely your trouble is in exposing proper information to MR for scheduling. HDFS gives 3 locations per block, for example, so the MR scheduler has a small collection to choose from. You can devise other tests but what you've thought of above is a good, simple one. -- Harsh J |