Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig hanging just before generate jar


Copy link to this message
-
Re: Pig hanging just before generate jar
Can you try loading the input files without the schema?

raw = LOAD '$log_path' using PigStorage('\t', '-noschema');

PigStorage by default looks for schema files and that *may* be slowing down
things (based on your assessment of slowness due to the # of input dirs).
On Mon, Jun 3, 2013 at 12:59 PM, Eugene Morozov
<[EMAIL PROTECTED]>wrote:

> Hello!
>
>
> Question #1
> I noticed couple of days ago that my scripts started running slower than
> usual. I experimented a bit and it turns out that "compilation" time
> depends on how many input files I give to my script. By compilation I mean
> everything it does after Pig is being run and before I see new job in
> JobTracker webUI.
>
> I have 3600 input files that lives in 24 different folders with names 00 to
> 23. Pig consumes different amount of time starting from pig -p
> input_path=... my-script.pig up to generating jar step depending on how
> many input files the script should process. When I give it just one
> directory like 00/* it takes only 10-20 seconds before starting job. When I
> use bunch of directories as a param 0?/*   then it takes about 120-240
> seconds. And it consumes tremendous 15 minutes when I use all my data.
>
> During that hanging (and seems doing nothing) period of time I use
> java/bin/jstack and strace and I see that there are only two active
> threads:
> * FIRST
>         epoll_wait(291, {}, 1024, 0)            = 0
>         read(287,
>
> "\6\10\327\205\25\20\0\0\0\0;\n9\10\2\22\0\30\254\264\264'\"\3\10\244\3*\7per"...,
> 8192) = 70
>         futex(0x4907b534, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4907b530,
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>         futex(0x47d48a28, FUTEX_WAKE_PRIVATE, 1) = 1
>         clock_gettime(CLOCK_REALTIME, {1370272461, 649119000}) = 0
>         futex(0x4907b340, FUTEX_WAKE_PRIVATE, 1) = 1
>         futex(0x4907b344, FUTEX_WAIT_PRIVATE, 689631, {9, 998984000}) = -1
> EAGAIN (Resource temporarily unavailable)
>         futex(0x48c25928, FUTEX_WAKE_PRIVATE, 1) = 0
>         read(287, 0x2aaab1111000, 8192)         = -1 EAGAIN (Resource
> temporarily unavailable)
>         #287 is just a socket
>
> its java stack is
> "IPC Client (2138196637) connection to
> hbase01.303net.pvt/10.0.240.16:8020from emorozov" daemon prio=10
> tid=0x00002aaab108c000 nid=0x711 runnable
> [0x0000000042ed9000]
>    java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
>  at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>  - locked <0x00000000c1aab558> (a sun.nio.ch.Util$2)
> - locked <0x00000000c1aab548> (a java.util.Collections$UnmodifiableSet)
>  - locked <0x00000000c1aa4578> (a sun.nio.ch.EPollSelectorImpl)
> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>  at
>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:336)
> at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:158)
>  at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:154)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:127)
>  at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
>  at
>
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:386)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> - locked <0x00000000c1800600> (a java.io.BufferedInputStream)
>  at java.io.FilterInputStream.read(FilterInputStream.java:66)
> at
>
> com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:276)
>  at
>
> com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760)
> at
>
> com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB