Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Pig hanging just before generate jar


Copy link to this message
-
Pig hanging just before generate jar
Hello!
Question #1
I noticed couple of days ago that my scripts started running slower than
usual. I experimented a bit and it turns out that "compilation" time
depends on how many input files I give to my script. By compilation I mean
everything it does after Pig is being run and before I see new job in
JobTracker webUI.

I have 3600 input files that lives in 24 different folders with names 00 to
23. Pig consumes different amount of time starting from pig -p
input_path=... my-script.pig up to generating jar step depending on how
many input files the script should process. When I give it just one
directory like 00/* it takes only 10-20 seconds before starting job. When I
use bunch of directories as a param 0?/*   then it takes about 120-240
seconds. And it consumes tremendous 15 minutes when I use all my data.

During that hanging (and seems doing nothing) period of time I use
java/bin/jstack and strace and I see that there are only two active
threads:
* FIRST
        epoll_wait(291, {}, 1024, 0)            = 0
        read(287,
"\6\10\327\205\25\20\0\0\0\0;\n9\10\2\22\0\30\254\264\264'\"\3\10\244\3*\7per"...,
8192) = 70
        futex(0x4907b534, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4907b530,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
        futex(0x47d48a28, FUTEX_WAKE_PRIVATE, 1) = 1
        clock_gettime(CLOCK_REALTIME, {1370272461, 649119000}) = 0
        futex(0x4907b340, FUTEX_WAKE_PRIVATE, 1) = 1
        futex(0x4907b344, FUTEX_WAIT_PRIVATE, 689631, {9, 998984000}) = -1
EAGAIN (Resource temporarily unavailable)
        futex(0x48c25928, FUTEX_WAKE_PRIVATE, 1) = 0
        read(287, 0x2aaab1111000, 8192)         = -1 EAGAIN (Resource
temporarily unavailable)
        #287 is just a socket

its java stack is
"IPC Client (2138196637) connection to
hbase01.303net.pvt/10.0.240.16:8020from emorozov" daemon prio=10
tid=0x00002aaab108c000 nid=0x711 runnable
[0x0000000042ed9000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
 at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
 - locked <0x00000000c1aab558> (a sun.nio.ch.Util$2)
- locked <0x00000000c1aab548> (a java.util.Collections$UnmodifiableSet)
 - locked <0x00000000c1aa4578> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
 at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:158)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:154)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:127)
 at java.io.FilterInputStream.read(FilterInputStream.java:116)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
 at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:386)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
- locked <0x00000000c1800600> (a java.io.BufferedInputStream)
 at java.io.FilterInputStream.read(FilterInputStream.java:66)
at
com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:276)
 at
com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760)
at
com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288)
 at
com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:752)
at
org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcPayloadHeaderProtos.java:985)
 at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:882)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813)

* SECOND
        futex(0x4dd23a28, FUTEX_WAKE_PRIVATE, 1) = 0
        futex(0x4e0e9f94, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4e0e9f90,
¨FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1¼) = 1
        futex(0x2aaab105cf28, FUTEX_WAKE_PRIVATE, 1) = 1
        write(287,
"½0½0½0½306½10½10½2½20½0½30½256½265j½273½1½n½vgetFileInfo½22z½nx"..., 202)
= 202
        #287 is same socket
"main" prio=10 tid=0x000000004dd22800 nid=0x6fd in Object.wait()
[0x0000000041fc9000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client.call(Client.java:1146)
- locked <0x00000000eda48e00> (a org.apache.hadoop.ipc.Client$Call)
 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy9.getFileInfo(Unknown Source)
 at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy9.getFileInfo(Unknown Source)
 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:628)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1507)
 at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:783)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1257)
 at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)
at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131)
 at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:147)
at
org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:153)
 at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:131)
at org.apache.pig.builtin.JsonM
+
Prashant Kommireddi 2013-06-03, 21:07
+
Eugene Morozov 2013-06-04, 06:48
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB