Overall, I think the consideration should be about how much load do
you expect to support on your cluster. For HDFS, there's a good amount
of information about how much RAM is required to support a certain
amount of data stored in DFS; something similar can be found for
Map/Reduce as well. There are also a few configuration options to let
the Jobtracker use lesser memory. I suppose that depending on your
load, your answer could really have to be "increase the RAM
configuration" rather than any tweaks of the JVM heap sizes or any
other configuration. Please do consider that first.
Anyway, some answers to your questions inline:
> Machines in my cluster have relatively small physical memory (4GB)
How much is the swap ? While it is available for use as well, it is
not advisable, because once the JVM starts to thrash to disk, in our
experience, it degrades performance rapidly.
> I was wondering if I could reduce the heap size that namenode and jobtracker
> are assigned.
> The default heap size is 1000MB respectively, and I know that.
> The thing is, does that 1000MB mean maximum possible memory that namenode(or
> jobtracker) can use?
> What I mean is that does namenode start with minimum memory and increase the
> memory size all the way up to 1000MB depending on the job status?
> Or is namenode given 1000MB from the beginning so that there is no
> flexibility at all?
If you want you can control this using another parameter -Xms set to
the JVM. This specifies the VM to start with the specified heap size
and then increase.
> If namenode and jobtracker do start with solid 1000MB then I would have to
> dial them down to several hundreds of mega byte since I only 4GB of memory.
> 2giga bytes of memory taken up just by namenode and jobtracker is too much
> an expense for me.
> My question also applies to heap size of child JVM. I know that they are
> originally given 200MB of heap size.
> I intend to increase the heap size to 512MB, but if the heap size allocation
> has no flexibility then I'd have to maintain the 200MB configuration.
> Take out the 2GB (used by namenode and jobtracker) from the total 4GB, I can
> have only 4 map/reduce tasks with 512MB configuration and since I have quad
> core CPU this would be a waste.
Please also take into account datanodes/tasktrackers and the OS itself.
> Oh, and one last thing.
> I am using Hadoop streaming.
> I read from a book that when you are using hadoop streaming, you should
> allocate less heap size to child JVM. (I am not sure if it meant less than
> 200MB or less than 400MB)
> Because streaming does not allow enough memory for user's processes to run.
> So what is the optimal heap size for map/reduce tasks in hadoop streaming?
> My plan was to increase the heap size of the child JVM to 512MB.
> But if what the book says is true, there is no point.
I think the intent is to say that when you are using Streaming, the
Child task is not really memory intensive as all the work is going to
be done by the streaming executable and so you can experiment with
much lower values than if you want to run pure Java M/R tasks. I am
not sure what you mean by "streaming does not allow enough memory for
user's processes to run".