|
|
-
Which strategy is proper to run an this enviroment?
Jun Young Kim 2011-02-12, 04:07
Hi.
I have small clusters (9 nodes) to run a hadoop here.
Under this cluster, a hadoop will take thousands of directories sequencely.
In a each dir, there is two input files to m/r. Size of input files are from 1m to 5g bytes. In a summary, each hadoop job will take an one of these dirs.
To get best performance, which strategy is proper for us?
Could u suggest me about it? Which configuration is best?
Ps) physical memory size is 12g of each node.
-
Re: Which strategy is proper to run an this enviroment?
Ted Dunning 2011-02-12, 19:33
This sounds like it will be very inefficient. There is considerable overhead in starting Hadoop jobs. As you describe it, you will be starting thousands of jobs and paying this penalty many times.
Is there a way that you could process all of the directories in one map-reduce job? Can you combine these directories into a single directory with a few large files?
On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kim <[EMAIL PROTECTED]> wrote:
> Hi. > > I have small clusters (9 nodes) to run a hadoop here. > > Under this cluster, a hadoop will take thousands of directories sequencely. > > In a each dir, there is two input files to m/r. Size of input files are > from > 1m to 5g bytes. > In a summary, each hadoop job will take an one of these dirs. > > To get best performance, which strategy is proper for us? > > Could u suggest me about it? > Which configuration is best? > > Ps) physical memory size is 12g of each node. >
-
Re: Which strategy is proper to run an this enviroment?
Jun Young Kim 2011-02-14, 02:12
In a similar way, could I set all directories in an input at one? (not combine them in a single directory?)
Currently, it's not easy to process at an one time all because the generated times of all directories are quite different.
but, periodically, we can set many directories as an input for a hadoop.
anyway, I've tested about 11000 directories to get M/R outputs.
total running time : 6H 25M almost Jobs are done in minutes.
Junyoung Kim ([EMAIL PROTECTED]) On 02/13/2011 04:33 AM, Ted Dunning wrote: > This sounds like it will be very inefficient. There is considerable > overhead in starting Hadoop jobs. As you describe it, you will be starting > thousands of jobs and paying this penalty many times. > > Is there a way that you could process all of the directories in one > map-reduce job? Can you combine these directories into a single directory > with a few large files? > > On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kim<[EMAIL PROTECTED]> wrote: > >> Hi. >> >> I have small clusters (9 nodes) to run a hadoop here. >> >> Under this cluster, a hadoop will take thousands of directories sequencely. >> >> In a each dir, there is two input files to m/r. Size of input files are >> from >> 1m to 5g bytes. >> In a summary, each hadoop job will take an one of these dirs. >> >> To get best performance, which strategy is proper for us? >> >> Could u suggest me about it? >> Which configuration is best? >> >> Ps) physical memory size is 12g of each node. >>
|
|