Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: How to test Hadoop MapReduce under another File System NOT HDFS


+
Julien Muller 2013-02-21, 13:26
+
Ling Kun 2013-02-22, 07:40
Copy link to this message
-
Re: How to test Hadoop MapReduce under another File System NOT HDFS
Dear Harsh J,
   Firstly, Thanks for your quick and detailed reply. Your suggestion is
very helpful to me !

1. For the Hadoop MapReduce regression test:
1.1  In theory, as long as I have correctly implement all the
org.apache.hadoop.fs.FileSystem interface, the Hadoop MR should work
correctly.  right?

1.2 I have found some existing implementation of other filesystem's
regression test, and according to these tests, except some internal test,
these implementations focus on File operations, like create, delete, copy,
isDirectory, status, etc.

2 and 3:  For the performance, since each node of the cluster I am using,
have 4x1G Ethernet with network bounding. We can just assume that the
network connection is faster than local disk.  I am afraid that block
location is not the key performance issue.

    Also, according to my test,  Non-local block read/write performance
just likes local block read/write ( with performance difference less than
5% )

So, Is there any other filesystem related performance issues beside
GetBlockLocation
Thanks again, Harsh
Ling Kun
On Thu, Feb 21, 2013 at 9:23 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Ling,
>
> On Thu, Feb 21, 2013 at 6:42 PM, Ling Kun <[EMAIL PROTECTED]> wrote:
> > Dear all,
> >     I am currently look into some other filesystem implementation, like
> > lustre, gluster, or other NAS with POSIX support, and trying to replace
> HDFS
> > with it.
> >
> >     I have implement a filesystem class( AFS)  which will provide
> interface
> > to Hadoop MapReduce, like the one of RawLocalFileSystem, and examples
> like
> > wordcount, terasort works well.
> >
> >    However, I am not sure whether my implementation is correct for all
> the
> > MapReduce applications that Hadoop MapReduce+Hadoop HDFS can run.
> >
> >    My question is :
> > 1. How Hadoop community do MapReduce regression test for any update of
> > Hadoop HDFS and Hadoop  MapReduce
>
> We have several unit tests under HDFS sources (you can view it in the
> sources) which catch quite a bit of regressions, if not all, and for
> performance differences we manually test with real-life workloads,
> aside of generic stressing tests such as teragen/sort. The Apache
> Bigtop project also has integration level tests for the Apache Hadoop
> ecosystem, which adds onto the stack.
>
> > 2. Beside MapReduce wordcount and Terasort examples, are there any
> missing
> > filesystem interface support for MapReduce application. Since the
> FileSystem
> > has POSIX support, the hsync have also supported.
>
> Not sure what you mean here. The major bit of HDFS that MR makes good
> use of is not the FS impl but the ability of HDFS to expose its
> locations of replicas to MR for it to schedule its tasks better. This
> is done by implementing the
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
> form of APIs in HDFS. The blocks feature of HDFS increases processing
> parallelism potentials, and the locality APIs help leverage that.
>
> > 3. According to my test, the performance is worse than the
> HDFS+MapReduce.
> > Any suggestion or hint on the performance analysis? ( Without MapReduce,
> the
> > performance of the filesystem is better than HDFS and also local
> > filesystem).
> > 3.1 the following are the same for the performance comparation:
> > 3.1.1 architecture: 4 node for MR, and another different 4 nodes for
> > HDFS/AFS
> > 3.1.2 application: the input size , the number of mapper and reducers are
> > the same.
>
> Am guessing most likely your trouble is in exposing proper information
> to MR for scheduling. HDFS gives 3 locations per block, for example,
> so the MR scheduler has a small collection to choose from. You can
> devise other tests but what you've thought of above is a good, simple
> one.
>
>
> --
> Harsh J
>
> --
> http://www.lingcc.com