Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages


Copy link to this message
-
Re: Job setup for a pig run takes ages
Hi Markus,
Thanks for reporting the results of the change, and the jstack.
The jstack information is useful, as I suspected the time is being spent
finding the schema of from the load function (which is taking really
long because avro must be stat'ing each of the large number of files to
determine the schema).
We can also improve things in pig by making fewer calls to the
LoadFunc's getSchema().
Thanks,
Thejas
On 6/4/12 10:28 AM, Markus Resch wrote:
> Hi Thejas,
>
> Starting from you assumption we did some investigation by generating
> some test data in chunks of 500MByte and ran the script on that and the
> result was extremely fast,
>
> Thanks for that hint.
>
> Markus
>
> I also did that jstack thing and here is where the thread hangs: (in
> both "lags" btw)
>
>
> "main" prio=10 tid=0x000000005cb1e800 nid=0x545d runnable
> [0x0000000041e85000]
>     java.lang.Thread.State: RUNNABLE
>          at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>          at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>          at
> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>          at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>          - locked<0x00000000c2725498>  (a sun.nio.ch.Util$2)
>          - locked<0x00000000c2725488>  (a java.util.Collections
> $UnmodifiableSet)
>          - locked<0x00000000c2725260>  (a sun.nio.ch.EPollSelectorImpl)
>          at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>          at org.apache.hadoop.net.SocketIOWithTimeout
> $SelectorPool.select(SocketIOWithTimeout.java:332)
>          at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>          at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>          at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>          at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>          at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>          - locked<0x00000000eb76dec0>  (a java.io.BufferedInputStream)
>          at java.io.DataInputStream.readShort(DataInputStream.java:295)
>          at org.apache.hadoop.hdfs.DFSClient
> $RemoteBlockReader.newBlockReader(DFSClient.java:1664)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.getBlockReader(DFSClient.java:2383)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.blockSeekTo(DFSClient.java:2056)
>          - locked<0x00000000eb768c20>  (a
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
>          at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.read(DFSClient.java:2170)
>          - locked<0x00000000eb768c20>  (a
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
>          at java.io.DataInputStream.read(DataInputStream.java:132)
>          at org.apache.avro.io.BinaryDecoder
> $InputStreamByteSource.readRaw(BinaryDecoder.java:804)
>          at
> org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:331)
>          at
> org.apache.avro.io.BinaryDecoder.readFixed(BinaryDecoder.java:287)
>          at org.apache.avro.io.Decoder.readFixed(Decoder.java:143)
>          at
> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:100)
>          at
> org.apache.avro.file.DataFileStream.<init>(DataFileStream.java:84)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:217)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:168)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144)
>          at
> org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:297)
>          at
> org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:186)
>          at
> org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:151)
>          at
> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:851)