Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages


Copy link to this message
-
RE: Job setup for a pig run takes ages
We also run into the long setup time issue, but our problem is different

1. The setup time takes about 20minutes, we can't see anything on the jobtracker during this setup time.
2. Our data is saved in flat file, uncompressed.
3. Our code consists of many small pig files, they are used in the following way in the main pig file
data_1 = load ...
data_2 = load ...
...
data_n = load ...

run -param ... pigfile1.pig
run -param ... pigfile2.pig
...

store out1 ..
store out2 ..
...
4. here's the part of the log file during the setup time, notice the time difference between "13:46:42" to "14:05:23", during that time, we can't see anything on the jobtracker.
...
2012-06-13 13:46:30,488 [main] INFO  org.apache.pig.Main - Logging error messages to: pig_1339609590477.log
2012-06-13 13:46:30,796 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master:9000
2012-06-13 13:46:30,950 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master:9001
2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_fir. Using value : Account position (\\$
2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_sec. Using value K,
2012-06-13 13:46:32,766 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_thi. Using value %)
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_pre. Using value  - matures on
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for detail_statment_post. Using value .
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_fir. Using value : Maturity date
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_sec. Using value  Account position (\\$
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for rationale_thi. Using value K,
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_pre. Using value  matures on
2012-06-13 13:46:32,767 [main] WARN  org.apache.pig.tools.parameters.PreprocessorContext - Warning : Multiple values found for catalyst_post. Using value .
2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: REPLICATED_JOIN,HASH_JOIN,COGROUP,GROUP_BY,ORDER_BY,DISTINCT,STREAMING,FILTER,CROSS,UNION
2012-06-13 13:46:42,749 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-06-13 14:05:23,460 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for var_raw: $0, $1, $2, $6, $7, $8, $9, $10
2012-06-13 14:05:23,474 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for var_mf: $5, $6, $7, $8, $9, $11, $12, $14, $15, $16, $17, $18, $19, $21, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45
2012-06-13 14:05:23,475 [main] INFO  org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for starmine: $0, $3, $4, $5, $6, $9, $10, $11
...

Any help will be appreciated.

Thanks.
Dan

-----Original Message-----
From: Markus Resch [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 13, 2012 2:24 AM
To: [EMAIL PROTECTED]
Subject: Re: Job setup for a pig run takes ages

Hey Alex,

On one side I think you're right but we need to keep in mind that the schema could change within some files of a glob (e.g. schema extension
update) the Avro Storage should check at least some hash of the schema to verify all schemas of all input files are the same and/or to split them into groups of different schemas if required.

I'm currently about to check this issue with the cloudera cdh4 pig version. I'll let you know if we get significant different behavior.

Best
Markus

Am Dienstag, den 12.06.2012, 19:16 -0400 schrieb Alex Rovner:

Markus Resch
Software Developer
P: +49 6103-5715-236 | F: +49 6103-5715-111 | ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany www.adtech.com<http://www.adtech.com>

ADTECH | A Division of Advertising.com Group - Residence of the Company:
Dreieich, Germany - Registration Office: Offenbach, HRB 46021 Management Board: Erhard Neumann, Mark Thielen

This message contains privileged and confidential information. Any dissemination, distribution, copying or other use of this message or any of its content (data, prices...) to any third parties may only occur with ADTECH's prior consent.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB