Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: M/R, Strange behavior with multiple Gzip files


+
x6i4uybz labs 2012-12-06, 09:57
+
Jean-Marc Spaggiari 2012-12-06, 12:34
+
x6i4uybz labs 2012-12-06, 14:40
Copy link to this message
-
Re: M/R, Strange behavior with multiple Gzip files
I tend to agree with Jean-Marc's observation. If your job client logs
a "LocalJobRunner" at any point, then that is most definitely your
problem.

Otherwise, if you feel you are facing a scheduling problem, then it
may most likely be your scheduler configuration. For example,
FairScheduler has a <maxMaps/> attribute over its pools that you can
set to control maximum parallel use of slots for jobs using that pool,
etc..

On Thu, Dec 6, 2012 at 8:10 PM, x6i4uybz labs <[EMAIL PROTECTED]> wrote:
> Hello,
>
> The job isn't running in local mode. In fact, I think I have just a problem
> with the map task progression.
> The counters of each map task are OK during the job execution whereas the
> progression of each map task stays at 0%.
>
>
>
> On Thu, Dec 6, 2012 at 1:34 PM, Jean-Marc Spaggiari
> <[EMAIL PROTECTED]> wrote:
>>
>> Hi,
>>
>> Have you configured the mapredsite.xml to tell where the job tracker
>> is? If not, your job is running on the local jobtracker, running the
>> tasks one by one.
>>
>> JM
>>
>> PS: I faced the same issue few weeks ago and got the exact same
>> behaviour. This (above) solved the issue.
>>
>> 2012/12/6, x6i4uybz labs <[EMAIL PROTECTED]>:
>> > Sorry,
>> >
>> > I wrote a job M/R to process several gz files (about 2000). I've a 80
>> > map
>> > slots cluster
>> > JT instantiates one map per gz file (not splittable, it's OK).
>> >
>> > The first 80 maps spawn. But after "initializing" state,  it seems there
>> > is
>> > one map running. And when this map is finished, another one started (not
>> > 80
>> > maps in parallel) and another is affected to the empty slot.
>> >
>> > I've also noticed, the first maps last more than one hour and the last
>> > maps
>> > 50 seconds.
>> > Each gz file is between 10mb and 100mb.
>> >
>> > I don't understand the behavior.
>> > I will launch again the job to see if I've the same issue.
>> >
>> > thanks, gpo
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Dec 5, 2012 at 6:33 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> >
>> >> Your problem isn't clear in your description - can you please
>> >> rephrase/redefine in terms of what you are expecting vs. what you are
>> >> observing.
>> >>
>> >> Also note that Gzip files are not splittable by nature of their codec
>> >> algorithm, and hence a TextInputFormat over plain/regular Gzip files
>> >> would end up spawning and/or processing one whole Gzip file via one
>> >> mapper, instead of multiple mappers per file.
>> >>
>> >> On Wed, Dec 5, 2012 at 9:32 PM, x6i4uybz labs
>> >> <[EMAIL PROTECTED]>
>> >> wrote:
>> >> > Hi everybody,
>> >> >
>> >> > I have a M/R job which does a bulk import to hbase.
>> >> > I have to process many gzip files (2800 x ~ 100mb)
>> >> >
>> >> > I don't understand why my job instanciates 80 maps but runs each map
>> >> > sequentialy like if there is only one big gz file.
>> >> >
>> >> > Is there a problem in my driver ? Or maybe I miss something.
>> >> > I use "FileInputFormat.addInputPath(job, new Path(args[0]))" where
>> >> args[0]
>> >> > is a directory.
>> >> >
>> >> > Can you help me, please ?
>> >> >
>> >> > Thanks, Guillaume
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >>
>> >
>
>

--
Harsh J
+
x6i4uybz labs 2012-12-06, 16:53
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB