Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages

Copy link to this message
Re: Job setup for a pig run takes ages
Hi Markus,

Have you checked the JobTracker at the time of launching the job that Map
slots were available?

Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
jobs are running fine, there should be no reason a larger dataset should be
stuck, atleast not on Pig side. I can't think of a good reason why the job
does not take off other than the fact that cluster was busy running some
other job.

I see that the number of files being processed is large, 50353. That could
be a reason for slowness, but ~8 minutes as shown in the logs seems to be
on the higher end for that.

May be also post your script here.

On Thu, May 31, 2012 at 2:38 AM, Markus Resch <[EMAIL PROTECTED]>wrote:

> Hi all,
> when we're running a pig job for aggregating some amount of slightly
> compressed avro data (~160GByte), the time until the first actual mapred
> job starts takes ages:
> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> to:
> [...]
> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> Pig features used in the script:
> [...]
> 16:07:00,969 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> [...]
> 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> [...]
> 16:15:38,022 [Thread-16] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 50353
> This log messages are from our test cluster which has a dedicated
> jobtracker and namenode each and 5 data nodes with a map task capacity
> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> 464 reduce tasks set up.
> During the initialisation phase we were observing the work load and
> memory usage of jobtracker, namenode and some data nodes using top.
> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> namenode, total idle on he data nodes). When the jobs were running most
> of the tasks where in "waiting for IO" most of the time. It seemed there
> was some swapping space reserved but rarely used in those times.
> In our eyes it looks like a hadoop config issue but we have no idea what
> it exaclty could be. Jobs with about 10GBytes of input data were running
> quite well.
> Any hint where to tweak will be appreciated
> Thanks
> Markus