This sounds like it will be very inefficient. There is considerable
overhead in starting Hadoop jobs. As you describe it, you will be starting
thousands of jobs and paying this penalty many times.
Is there a way that you could process all of the directories in one
map-reduce job? Can you combine these directories into a single directory
with a few large files?
On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kim <[EMAIL PROTECTED]> wrote:
> I have small clusters (9 nodes) to run a hadoop here.
> Under this cluster, a hadoop will take thousands of directories sequencely.
> In a each dir, there is two input files to m/r. Size of input files are
> 1m to 5g bytes.
> In a summary, each hadoop job will take an one of these dirs.
> To get best performance, which strategy is proper for us?
> Could u suggest me about it?
> Which configuration is best?
> Ps) physical memory size is 12g of each node.