Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages


Copy link to this message
-
Re: Job setup for a pig run takes ages
Hey Markus,

I am also interested to look at your pig script. I think there is some
insight to be gained here.

Thanks,
Ashutosh
On Fri, Jun 1, 2012 at 6:34 AM, Markus Resch <[EMAIL PROTECTED]> wrote:

> Hi Prashant, Hi Thejas,
>
> thanks for your very quick answer.
> No, this is not a typo. Those time stamps are true and as I said the
> machines are not very busy during this time.
>
> As this is our test cluster I am sure I am the only one who is running
> jobs on it. Another issue we have is that we are currently only able to
> run one job at a time but this shouldn't be the topic of this request.
> We even have no continuous input stream to that cluster but copied a
> bunch of data to it some time ago.
> From my perspective the 464 GB of input data you are mentioned is the
> uncompressed amount of the 160GByte compressed files. Which I get when I
> use hadoop -f dus on that folder.
>
> Another interesting fact for you could be that we're running the
> cloudera CDH3 Update 3 version on our systems.
>
> I suspect this could be due to some fancy avro schema validation
> implicitly executed by the avro storage? If so, can this be avoided?
>
> Sadly I'm currently not able to provide you the actual script currently
> as it contains confidential information but I will try to provide you a
> version as soon as possible. But I'd rather think of a configuration
> problem to the hadoop or pig anyways as the script works fine with a
> smaller amount of input data
>
> I would ask the hadoop mailing list if this issue would occur during the
> actual mapred run but as this occur even before a single mapred job is
> launched I suspect pig to have a problem.
>
> Thanks
> Markus
>
> This is the full log until the main work job starts:
> mapred@ournamenode$ pig OurScript.pig
> 2012-05-30 15:27:21,052 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /tmp/pig_1338384441037.log
> 2012-05-30 15:27:21,368 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://OurNamenode:9000
> 2012-05-30 15:27:21,609 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at:
> dev-jobtracker001.eu-fra.adtech.com:54311
> 2012-05-30 15:57:27,814 [main] WARN  org.apache.pig.PigServer -
> Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s).
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: REPLICATED_JOIN,COGROUP,GROUP_BY,FILTER
> 2012-05-30 15:57:27,816 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2012-05-30 16:06:55,304 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for CampaignInfo: $0, $1, $2, $4, $5, $6, $8, $9
> 2012-05-30 16:06:55,308 [main] INFO
> org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned
> for dataImport: $2, $3, $4
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData1:
> Store(SomeOutputFile1.csv:org.apache.pig.builtin.PigStorage) - scope-521
> Operator Key: scope-521)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile2.csv:org.apache.pig.builtin.PigStorage) -
> scope-524 Operator Key: scope-524)
> 2012-05-30 16:06:55,441 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
> OutputData2:
> Store(/SomeOutputFile3.csv:org.apache.pig.builtin.PigStorage) -
> scope-483 Operator Key: scope-483)
> 2012-05-30 16:06:55,453 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> - File concatenation threshold: 100 optimistic? false
> 2012-05-30 16:06:55,467 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
> paths to process : 1