|
Kasi Subrahmanyam
2012-07-04, 12:02
Robert Evans
2012-07-06, 17:00
Phani
2012-07-07, 07:24
Manoj Babu
2012-07-09, 17:57
Karthik Kambatla
2012-07-09, 19:02
Manoj Babu
2012-07-10, 06:57
Karthik Kambatla
2012-07-10, 08:39
|
-
issue with map running timeKasi Subrahmanyam 2012-07-04, 12:02
Hi ,
I have a job which has let us say 10 mappers running in parallel. Some are running fast but few of them are taking too long to run. For example few mappers are taking 5 to 10 mins but others are taking around 12 hours or more. Does the difference in the data handled by the mappers can cause such a variation or is it the issue with connectivity. Note:The cluster we are using have multiple users running their jobs on it. Thanks in advance. Subbu
-
Re: issue with map running timeRobert Evans 2012-07-06, 17:00
How long a program takes to run depends on a lot of things. It could be a connectivity issue, or it could be that your program does a lot more processing for some input records then for others, or it could be that some of your records are a lot smaller so that more of them exist in a single input split. Without knowing what the code is doing it is hard to say more then that.
--Bobby Evans From: Kasi Subrahmanyam <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Reply-To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Subject: issue with map running time Hi , I have a job which has let us say 10 mappers running in parallel. Some are running fast but few of them are taking too long to run. For example few mappers are taking 5 to 10 mins but others are taking around 12 hours or more. Does the difference in the data handled by the mappers can cause such a variation or is it the issue with connectivity. Note:The cluster we are using have multiple users running their jobs on it. Thanks in advance. Subbu
-
Re: issue with map running timePhani 2012-07-07, 07:24
Other users might have consumed all map slots which may have caused long wait times for some mapper in your job. In such cases I would watch the queues closely and reconsider job distribution to grid queues with sufficient map slots.
Thanks, Phani Best Regards, Phani [EMAIL PROTECTED] >________________________________ > From: Robert Evans <[EMAIL PROTECTED]> >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >Sent: Friday, 6 July 2012 10:30 PM >Subject: Re: issue with map running time > > >How long a program takes to run depends on a lot of things. It could be a connectivity issue, or it could be that your program does a lot more processing for some input records then for others, or it could be that some of your records are a lot smaller so that more of them exist in a single input split. Without knowing what the code is doing it is hard to say more then that. > > >--Bobby Evans > >From: Kasi Subrahmanyam <[EMAIL PROTECTED]> >Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >Subject: issue with map running time > > > >Hi , > >I have a job which has let us say 10 mappers running in parallel. >Some are running fast but few of them are taking too long to run. >For example few mappers are taking 5 to 10 mins but others are taking around 12 hours or more. >Does the difference in the data handled by the mappers can cause such a variation or is it the issue with connectivity. > >Note:The cluster we are using have multiple users running their jobs on it. > >Thanks in advance. >Subbu > > > --- Sent via Epic Browser
-
Re: issue with map running timeManoj Babu 2012-07-09, 17:57
Hi Bobby,
I have faced a similar issue, In the job the block size is 64MB and the no of the maps created is 656 and the no of files uploaded to HDFS is 656 and its each file size is 11MB. I assume that if small files exist it will not able to group. Could kindly clarify it? Cheers! Manoj. On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <[EMAIL PROTECTED]> wrote: > How long a program takes to run depends on a lot of things. It could be a > connectivity issue, or it could be that your program does a lot more > processing for some input records then for others, or it could be that some > of your records are a lot smaller so that more of them exist in a single > input split. Without knowing what the code is doing it is hard to say > more then that. > > --Bobby Evans > > From: Kasi Subrahmanyam <[EMAIL PROTECTED]> > Reply-To: "[EMAIL PROTECTED]" < > [EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Subject: issue with map running time > > Hi , > > I have a job which has let us say 10 mappers running in parallel. > Some are running fast but few of them are taking too long to run. > For example few mappers are taking 5 to 10 mins but others are taking > around 12 hours or more. > Does the difference in the data handled by the mappers can cause such a > variation or is it the issue with connectivity. > > Note:The cluster we are using have multiple users running their jobs on it. > > Thanks in advance. > Subbu >
-
Re: issue with map running timeKarthik Kambatla 2012-07-09, 19:02
Hi Manoj,
It seems like a different issue. Let me understand you case better. Is your input 656 files of 11 MB each? In that case, MapReduce does create 656 map tasks. In general, an input split is the data read from a single file, but limited to the block size (64 MB in your case). As the files are smaller than 64 MB, each file forms a different split. Hope that helps. Karthik On Mon, Jul 9, 2012 at 10:57 AM, Manoj Babu <[EMAIL PROTECTED]> wrote: > Hi Bobby, > > I have faced a similar issue, In the job the block size is 64MB and the no > of the maps created is 656 and the no of files uploaded to HDFS is 656 and > its each file size is 11MB. I assume that if small files exist it will not > able to group. > > Could kindly clarify it? > > Cheers! > Manoj. > > > > On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <[EMAIL PROTECTED]> wrote: > >> How long a program takes to run depends on a lot of things. It could be >> a connectivity issue, or it could be that your program does a lot more >> processing for some input records then for others, or it could be that some >> of your records are a lot smaller so that more of them exist in a single >> input split. Without knowing what the code is doing it is hard to say >> more then that. >> >> --Bobby Evans >> >> From: Kasi Subrahmanyam <[EMAIL PROTECTED]> >> Reply-To: "[EMAIL PROTECTED]" < >> [EMAIL PROTECTED]> >> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> Subject: issue with map running time >> >> Hi , >> >> I have a job which has let us say 10 mappers running in parallel. >> Some are running fast but few of them are taking too long to run. >> For example few mappers are taking 5 to 10 mins but others are taking >> around 12 hours or more. >> Does the difference in the data handled by the mappers can cause such a >> variation or is it the issue with connectivity. >> >> Note:The cluster we are using have multiple users running their jobs on >> it. >> >> Thanks in advance. >> Subbu >> > >
-
Re: issue with map running timeManoj Babu 2012-07-10, 06:57
Thanks Karthik. But how we can overcome that? do we need to user different
file format? Also am using the below code to merge all files into single file. Is it a proper way to do it? FileStatus[] inputFiles = local.listStatus(inputDir); FSDataOutputStream out = hdfs.create(hdfsFile); for(int i = 0; i < inputFiles.length; i++) { System.out.println(inputFiles[i].getPath().getName()); FSDataInputStream in = local.open(inputFiles[i].getPath()); byte buffer[] = new byte[256]; int bytesRead = 0; while((bytesRead = in.read(buffer)) > 0) { out.write(buffer, 0, bytesRead); } in.close(); } out.close(); Cheers! Manoj. On Tue, Jul 10, 2012 at 12:32 AM, Karthik Kambatla <[EMAIL PROTECTED]>wrote: > Hi Manoj, > > It seems like a different issue. > > Let me understand you case better. Is your input 656 files of 11 MB each? > In that case, MapReduce does create 656 map tasks. In general, an input > split is the data read from a single file, but limited to the block size > (64 MB in your case). As the files are smaller than 64 MB, each file forms > a different split. > > Hope that helps. > Karthik > > > On Mon, Jul 9, 2012 at 10:57 AM, Manoj Babu <[EMAIL PROTECTED]> wrote: > >> Hi Bobby, >> >> I have faced a similar issue, In the job the block size is 64MB and the >> no of the maps created is 656 and the no of files uploaded to HDFS is 656 >> and its each file size is 11MB. I assume that if small files exist it will >> not able to group. >> >> Could kindly clarify it? >> >> Cheers! >> Manoj. >> >> >> >> On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <[EMAIL PROTECTED]>wrote: >> >>> How long a program takes to run depends on a lot of things. It could be >>> a connectivity issue, or it could be that your program does a lot more >>> processing for some input records then for others, or it could be that some >>> of your records are a lot smaller so that more of them exist in a single >>> input split. Without knowing what the code is doing it is hard to say >>> more then that. >>> >>> --Bobby Evans >>> >>> From: Kasi Subrahmanyam <[EMAIL PROTECTED]> >>> Reply-To: "[EMAIL PROTECTED]" < >>> [EMAIL PROTECTED]> >>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED] >>> > >>> Subject: issue with map running time >>> >>> Hi , >>> >>> I have a job which has let us say 10 mappers running in parallel. >>> Some are running fast but few of them are taking too long to run. >>> For example few mappers are taking 5 to 10 mins but others are taking >>> around 12 hours or more. >>> Does the difference in the data handled by the mappers can cause such a >>> variation or is it the issue with connectivity. >>> >>> Note:The cluster we are using have multiple users running their jobs on >>> it. >>> >>> Thanks in advance. >>> Subbu >>> >> >> >
-
Re: issue with map running timeKarthik Kambatla 2012-07-10, 08:39
Manoj,
By running an MR job on many small files, one does incur latency costs of reading individual files. These costs can be addressed to some extent by re-using the JVM across map tasks (See mapred.job.reuse.jvm.num.tasks). When your data is already distributed across small files and several machines -- 1. It is probably best to use it as is if it is only one (very few) MR job(s). 2. Otherwise, it would probably make sense to write an MR job to copy many small files to a few big files. Your code for data copy seems about right, at least at first glance. Thanks Karthik On Mon, Jul 9, 2012 at 11:57 PM, Manoj Babu <[EMAIL PROTECTED]> wrote: > Thanks Karthik. But how we can overcome that? do we need to user different > file format? > Also am using the below code to merge all files into single file. Is it a > proper way to do it? > > > FileStatus[] inputFiles = local.listStatus(inputDir); > FSDataOutputStream out = hdfs.create(hdfsFile); > for(int i = 0; i < inputFiles.length; i++) { > System.out.println(inputFiles[i].getPath().getName()); > FSDataInputStream in = local.open(inputFiles[i].getPath()); > byte buffer[] = new byte[256]; > int bytesRead = 0; > while((bytesRead = in.read(buffer)) > 0) { > out.write(buffer, 0, bytesRead); > } > in.close(); > } > out.close(); > > Cheers! > Manoj. > > > > On Tue, Jul 10, 2012 at 12:32 AM, Karthik Kambatla <[EMAIL PROTECTED]>wrote: > >> Hi Manoj, >> >> It seems like a different issue. >> >> Let me understand you case better. Is your input 656 files of 11 MB each? >> In that case, MapReduce does create 656 map tasks. In general, an input >> split is the data read from a single file, but limited to the block size >> (64 MB in your case). As the files are smaller than 64 MB, each file forms >> a different split. >> >> Hope that helps. >> Karthik >> >> >> On Mon, Jul 9, 2012 at 10:57 AM, Manoj Babu <[EMAIL PROTECTED]> wrote: >> >>> Hi Bobby, >>> >>> I have faced a similar issue, In the job the block size is 64MB and the >>> no of the maps created is 656 and the no of files uploaded to HDFS is 656 >>> and its each file size is 11MB. I assume that if small files exist it will >>> not able to group. >>> >>> Could kindly clarify it? >>> >>> Cheers! >>> Manoj. >>> >>> >>> >>> On Fri, Jul 6, 2012 at 10:30 PM, Robert Evans <[EMAIL PROTECTED]>wrote: >>> >>>> How long a program takes to run depends on a lot of things. It could >>>> be a connectivity issue, or it could be that your program does a lot more >>>> processing for some input records then for others, or it could be that some >>>> of your records are a lot smaller so that more of them exist in a single >>>> input split. Without knowing what the code is doing it is hard to say >>>> more then that. >>>> >>>> --Bobby Evans >>>> >>>> From: Kasi Subrahmanyam <[EMAIL PROTECTED]> >>>> Reply-To: "[EMAIL PROTECTED]" < >>>> [EMAIL PROTECTED]> >>>> To: "[EMAIL PROTECTED]" < >>>> [EMAIL PROTECTED]> >>>> Subject: issue with map running time >>>> >>>> Hi , >>>> >>>> I have a job which has let us say 10 mappers running in parallel. >>>> Some are running fast but few of them are taking too long to run. >>>> For example few mappers are taking 5 to 10 mins but others are taking >>>> around 12 hours or more. >>>> Does the difference in the data handled by the mappers can cause such a >>>> variation or is it the issue with connectivity. >>>> >>>> Note:The cluster we are using have multiple users running their jobs on >>>> it. >>>> >>>> Thanks in advance. >>>> Subbu >>>> >>> >>> >> > |