Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Running 145K maps, zero reduces- does Hadoop scale?


Copy link to this message
-
Re: Running 145K maps, zero reduces- does Hadoop scale?
Simulation trials. Let N be the trials and T be the number of Map Tasks
(==splits)
Also assume there is much variation in the running time per trial.
If there are ~ K=N/T (assume K is an integer ) trials per task and suppose
K>1, it is possible that
two long running trials end up in the same task. Thus even if all the other
tasks
have completed one Mapper will be processing the K trials(belonging to that
task) even though there are free
Mappers(JVMs) available.
Speculative execution won't split this task and run the other trials across
other Mappers

If K==1, then every Map will be running one trial and it is basically a
single queue and multiple servers.

This inputformat works if say, I set the splits to 145K/2 (K=2 trials per
task).
e.g the running time on 50 machines was 12 minutes.
If however i set T=4700, (K~30 trials per task) the running time increases
to 20 minutes.
So ideally, I want K==1, which means for say 1MM tasks, I will have 1MM
inputsplits.

I understand this might not be what Hadoop was designed for, however I'ld
like to
get this confirmed, i.e maybe my inputformat is incorrectly designed.

Regards
Saptarshi
On Fri, Jul 31, 2009 at 7:51 AM, Amogh Vasekar <[EMAIL PROTECTED]> wrote:

> What is the use case for this? Especially since you have 0 reducers.
>
> Thanks,
> Amogh
>
> -----Original Message-----
> From: Saptarshi Guha [mailto:[EMAIL PROTECTED]]
> Sent: Friday, July 31, 2009 12:08 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Running 145K maps, zero reduces- does Hadoop scale?
>
> In this particular example, the record reader emits a single number per
> split as both key and value.
> Regards
> S
>
> On Fri, Jul 31, 2009 at 1:55 AM, Saptarshi Guha <[EMAIL PROTECTED]
> >wrote:
>
> > Hello,
> > Does Hadoop scale well for 100K+ input splits?
> > I have not tried with sequence files. My custom inputformat, generates
> 145K
> > splits.
> > The record reader emits about 15 bytes as key and 8 bytes as value.
> > It doesn't do anything else, in fact it doesn't read from disk (basically
> > it emits splitbeginning ... splitend for every split,)
> > So essentially, my inputformat is creating 145K InputSplit objects.(see
> > below)
> >
> > However I got this
> > 09/07/31 01:41:41 INFO mapred.JobClient: Running job:
> job_200907251335_0005
> > 09/07/31 01:41:42 INFO mapred.JobClient:  map 0% reduce 0%
> > 09/07/31 01:43:06 INFO mapred.JobClient: Job complete:
> > job_200907251335_0005
> > And the job does not end! Hangs here.
> >
> > Very strange. The jobtracker does not respond to web requests.
> > This is on Hadoop 0.20 though am using 0.19.1. api.
> > The  master is 64 bit with 4 cores and 16GB ram and not running any
> > tasktrackers.
> >
> > Any pointers would be appreciated
> >
> > Regards
> > Saptarshi
> >
> >
> >     //Basically FileInputSplit reworded
> >     public InputSplit[] getSplits(JobConf job, int numSplits) throws
> > IOException {
> >     long n = the_length_of_something ; //==145K
> >     long chunkSize = n / (numSplits == 0 ? 1 : numSplits);
> >     InputSplit[] splits = new InputSplit[numSplits];
> >     for (int i = 0; i < numSplits; i++) {
> >         MyInputSplit split;
> >         if ((i + 1) == numSplits)
> >         split = new MySplit(i * chunkSize, n);
> >         else
> >         split = new MySplit(i * chunkSize, (i * chunkSize) + chunkSize);
> >         splits[i] = split;
> >     }
> >     return splits;
> >     }
> >
> >
> >
>