Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages


Copy link to this message
-
Re: Job setup for a pig run takes ages
Hi Markus,
Thanks for reporting the results of the change, and the jstack.
The jstack information is useful, as I suspected the time is being spent
finding the schema of from the load function (which is taking really
long because avro must be stat'ing each of the large number of files to
determine the schema).
We can also improve things in pig by making fewer calls to the
LoadFunc's getSchema().
Thanks,
Thejas
On 6/4/12 10:28 AM, Markus Resch wrote:
> Hi Thejas,
>
> Starting from you assumption we did some investigation by generating
> some test data in chunks of 500MByte and ran the script on that and the
> result was extremely fast,
>
> Thanks for that hint.
>
> Markus
>
> I also did that jstack thing and here is where the thread hangs: (in
> both "lags" btw)
>
>
> "main" prio=10 tid=0x000000005cb1e800 nid=0x545d runnable
> [0x0000000041e85000]
>     java.lang.Thread.State: RUNNABLE
>          at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>          at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>          at
> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>          at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>          - locked<0x00000000c2725498>  (a sun.nio.ch.Util$2)
>          - locked<0x00000000c2725488>  (a java.util.Collections
> $UnmodifiableSet)
>          - locked<0x00000000c2725260>  (a sun.nio.ch.EPollSelectorImpl)
>          at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>          at org.apache.hadoop.net.SocketIOWithTimeout
> $SelectorPool.select(SocketIOWithTimeout.java:332)
>          at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>          at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>          at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>          at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>          at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>          - locked<0x00000000eb76dec0>  (a java.io.BufferedInputStream)
>          at java.io.DataInputStream.readShort(DataInputStream.java:295)
>          at org.apache.hadoop.hdfs.DFSClient
> $RemoteBlockReader.newBlockReader(DFSClient.java:1664)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.getBlockReader(DFSClient.java:2383)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.blockSeekTo(DFSClient.java:2056)
>          - locked<0x00000000eb768c20>  (a
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.read(DFSClient.java:2170)
>          - locked<0x00000000eb768c20>  (a
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
>          at java.io.DataInputStream.read(DataInputStream.java:132)
>          at org.apache.avro.io.BinaryDecoder
> $InputStreamByteSource.readRaw(BinaryDecoder.java:804)
>          at
> org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:331)
>          at
> org.apache.avro.io.BinaryDecoder.readFixed(BinaryDecoder.java:287)
>          at org.apache.avro.io.Decoder.readFixed(Decoder.java:143)
>          at
> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:100)
>          at
> org.apache.avro.file.DataFileStream.<init>(DataFileStream.java:84)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:217)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:168)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:297)
>          at
> org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:186)
>          at
> org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:151)
>          at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:851)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB