Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages


Copy link to this message
-
Re: Job setup for a pig run takes ages
Can you do a jstack <pid> on the pig client process id a few times and
see what it is doing when it is taking so long and send the results ?
With that we should be able to easily identify what might be happening
and suggest a fix.

I think pig calls getSchema() on load functions more times than actually
necessary during the query optimization phase (each time after a
transform). One theory I have is that your avro load function does a
stat on all the 50k files each time to determine the common schema. This
might be causing the delay to generate query plan.

About the second large lag before queue seems to be time spend in input
split calculations, not within pig logic.

These are just theories I have. The jstack output would help determine
if this is actually the case.

Also, I think this delay is likely to be caused by the number of input
files, and not the actual data size. You might also want to look at ways
to reduce the actual number of input files. Is it generated by a
previous MR job ? If yes, given your cluster size, it works against you
to have so many maps or reducers in the previous MR job. If pig is
generating the data,  you should consider setting the parallelism
(reducers) or the number of maps (see
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
While reducing number of input files will most likely solve your
problem, I am still interested in knowing where the delays are coming
from and if we can fix something in pig to improve the situation.

Thanks,
Thejas
On 6/1/12 6:34 AM, Markus Resch wrote:
> Hi Prashant, Hi Thejas,
>
> thanks for your very quick answer.
> No, this is not a typo. Those time stamps are true and as I said the
> machines are not very busy during this time.
>
> As this is our test cluster I am sure I am the only one who is running
> jobs on it. Another issue we have is that we are currently only able to
> run one job at a time but this shouldn't be the topic of this request.
> We even have no continuous input stream to that cluster but copied a
> bunch of data to it some time ago.
>  From my perspective the 464 GB of input data you are mentioned is the
> uncompressed amount of the 160GByte compressed files. Which I get when I
> use hadoop -f dus on that folder.
>
> Another interesting fact for you could be that we're running the
> cloudera CDH3 Update 3 version on our systems.
>
> I suspect this could be due to some fancy avro schema validation
> implicitly executed by the avro storage? If so, can this be avoided?
>
> Sadly I'm currently not able to provide you the actual script currently
> as it contains confidential information but I will try to provide you a
> version as soon as possible. But I'd rather think of a configuration
> problem to the hadoop or pig anyways as the script works fine with a
> smaller amount of input data
>
> I would ask the hadoop mailing list if this issue would occur during the
> actual mapred run but as this occur even before a single mapred job is
> launched I suspect pig to have a problem.
>
> Thanks
> Markus
>
> This is the full log until the main work job starts:
> mapred@ournamenode$ pig OurScript.pig
> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /tmp/pig_1338384441037.log
> 2012-05-30 15:27:21,368 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://OurNamenode:9000
> 2012-05-30 15:27:21,609 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at:
> dev-jobtracker001.eu-fra.adtech.com:54311
> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB