On 10 May 2012 17:38, Andrew Purtell <[EMAIL PROTECTED]> wrote:
> Regarding HDFS miniclusters, the interface is already limited-private
> and there is no pressing need, but we do have test cases where we need
> to simulate DataNode failures. Also, I can conceive of an application
> unit test where I would want to set replication to 1 on some file,
> then corrupt blocks, then check that repair (at the application level)
> was successful. Would some limited public interface for that be
I'm going to weigh in as fan of MiniDFS and MiniMR clusters.
-easiest way to spin up a basic Hadoop cluster
-lets you test failure handling as well as functionality
-lets you test things code that talks to DFS clusters remotely
-lets you test topology code
-very efficient for work that goes through a couple of hundred K records.
It's the best Hadoop cluster to run on a laptop.
Today's classes are very much designed for use within the Hadoop core and
even there, use in test runs
For example, they depend on system properties (build.test.data) to work (
https://issues.apache.org/jira/browse/HDFS-2209 ) ;
pre-2.0 you need a factory that patches things in at construct time:
That example and the equivalent for MiniMR cluster (*) not only fix up the
properties to work, they implement a getURI() method that returns the
relevant URI of the service -filesystem and JT respectively, which I've
found somewhat convenient.
In 2.0, as well as MiniMR cluster going away, something changed in the HDFS
interfaces that stopped my subclass from building -I think it was the
accessibility or location of HdfsConstants. Whatever it was, it is making
migration of my test code from 1.x to 2.x hard, which is discouraging me
from testing against it -I can't have test setups that work on both
Then there's the fact that on 1.x at least, the Mini clusters are hidden in
hadoop-test-x . jar, and it's not always been the case that this JAR has
made it onto the Maven repositories.
together then, these issues show that while MiniDFSCluster and the MR
equivalents work for the core code, where accessibility, backwards
compatibility and redistribution are non-issues, the classes aren't
designed for downstream use -yet the number of people trying to use them,
myself and Andrew included, shows that we want to.
I would like to see something stable and public that could be used in this
way. Stable classes in 1.x and 2.x that let your classes build on both
platforms, classes don't need fixup before creation, and artifacts like
hadoop-minicluster.jar that you can depend on without needing the rest of
the tests. Oh. and a set of tests to verify cluster stability.
+1 then to making mini clusters that downstream projects can use.
Am I going to volunteer to do this? I could add it to my todo list, which
means "not for a while". If someone else has a go, I'll promise to review
it and commit it in if it's ready.