-java.io.IOException: Task process exit with nonzero status of -1
This is a 4 node hadoop cluster running on CentOS 6.3 with Oracle JDK (64bit) 1.6.0_43. Each node has 32G memory, with max 8 mapper tasks and 4 reducer tasks being set. The hadoop version is 1.0.4.
This is setup on Datastax DES 3.0.2, which is using Cassandra CFS as underline DFS, instead of HDFS with NameNode. I understand this kind of setting is not really being tested with hadoop MR, but the above MR errors should not relate to it, at least from my guess.
I am running a simple MR job, partition data by DATE for 700G of 600 files. The MR logic is very straightforward, but in our above staging environment, I saw a lot of Reducers failed with the above error. I want to know the reason and fix it.
1) There is no log related to this error in the reducer task attempt log in user log directory. The only log related to this is in the system.log, which generated by cassandra processor: INFO [JVM Runner jvm_201308141528_0003_r_625176200 spawned.] 2013-08-15 07:28:59,326 JvmManager.java (line 510) JVM : jvm_201308141528_0003_r_625176200 exited with exit code -1. Number of tasks it ran: 0
2) I believe this error is related to the system resource, but just cannot google anything to be the root cause. From the log, I believe the JVM terminated/crashed for the reducer task, but I don't know the reason.
3) I checked the limits of the user which process is running under, here is the info, and I didn't spot any obvious problems.-bash-4.1$ ulimit -acore file size (blocks, -c) 0data seg size (kbytes, -d) unlimitedscheduling priority (-e) 0file size (blocks, -f) unlimitedpending signals (-i) 256589max locked memory (kbytes, -l) unlimitedmax memory size (kbytes, -m) unlimitedopen files (-n) 400000pipe size (512 bytes, -p) 8POSIX message queues (bytes, -q) 819200real-time priority (-r) 0stack size (kbytes, -s) 10240cpu time (seconds, -t) unlimitedmax user processes (-u) 32768virtual memory (kbytes, -v) unlimitedfile locks (-x) unlimited
4) Since this is a new cluster, there is really not too much hadoop setting changed from the default value. I did run the reducer as '-mx2048m', to set the heap size of JVM to 2G, as 1st time the reducers failed with OOM error. I google around, as it looks like people recommend to set "mapred.child.ulimit" to 3x of heap size, which should be around 6G in this case. I can give that a try, but in the nodes, the virtual memory is set to unlimited for user whom is running under, so I am not sure if this will really fix it.
5) Another possibility I found in google is that the child process return -1 when it failed to write to user logs, as Linux EXT3 has a limitation about how many file/directories can be created under one folder (32k?). But my system is using EXT4, and there is not too many MR jobs running so far.
6) I am really not sure what is the root cause of this, as exit code -1 could mean a lot. But I wonder any one here can give me more hints, or any help about debugging this issue in my environment? Is there any way in hapoop or JVM setting I can set to dump more info/log about why the JVM terminated at runtime with exit code -1?