Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages


Copy link to this message
-
Re: Job setup for a pig run takes ages
Hi Markus,

Have you checked the JobTracker at the time of launching the job that Map
slots were available?

Looks like the input dataset size is ~464 GB. Since you mentioned 10 GB
jobs are running fine, there should be no reason a larger dataset should be
stuck, atleast not on Pig side. I can't think of a good reason why the job
does not take off other than the fact that cluster was busy running some
other job.

I see that the number of files being processed is large, 50353. That could
be a reason for slowness, but ~8 minutes as shown in the logs seems to be
on the higher end for that.

May be also post your script here.

On Thu, May 31, 2012 at 2:38 AM, Markus Resch <[EMAIL PROTECTED]>wrote:

> Hi all,
>
> when we're running a pig job for aggregating some amount of slightly
> compressed avro data (~160GByte), the time until the first actual mapred
> job starts takes ages:
> 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error messages
> to:
> [...]
> 15:57:27,816 [main] INFO  org.apache.pig.tools.pigstats.ScriptState -
> Pig features used in the script:
> [...]
> 16:07:00,969 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> [...]
> 16:07:30,886 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=463325937621
> [...]
> 16:15:38,022 [Thread-16] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 50353
>
> This log messages are from our test cluster which has a dedicated
> jobtracker and namenode each and 5 data nodes with a map task capacity
> of 15 and a reduce task capacity of 10. There were 6899 map tasks and
> 464 reduce tasks set up.
>
> During the initialisation phase we were observing the work load and
> memory usage of jobtracker, namenode and some data nodes using top.
> Those were nearly all the time kind of bored (e.g. 30% cpu load on the
> namenode, total idle on he data nodes). When the jobs were running most
> of the tasks where in "waiting for IO" most of the time. It seemed there
> was some swapping space reserved but rarely used in those times.
>
> In our eyes it looks like a hadoop config issue but we have no idea what
> it exaclty could be. Jobs with about 10GBytes of input data were running
> quite well.
>
> Any hint where to tweak will be appreciated
>
> Thanks
> Markus
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB