|
|
-
process failed - java.lang.OutOfMemoryErrorDenis Lowe 2013-03-01, 23:57
process failed - java.lang.OutOfMemoryError
We observed the following error: 01 Mar 2013 21:37:24,807 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.HDFSEventSink.process:460) - process failed java.lang.OutOfMemoryError at org.apache.hadoop.io.compress.zlib.ZlibCompressor.init(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibCompressor.<init>(ZlibCompressor.java:222) at org.apache.hadoop.io.compress.GzipCodec$GzipZlibCompressor.<init>(GzipCodec.java:159) at org.apache.hadoop.io.compress.GzipCodec.createCompressor(GzipCodec.java:109) at org.apache.hadoop.io.compress.GzipCodec.createOutputStream(GzipCodec.java:92) at org.apache.flume.sink.hdfs.HDFSCompressedDataStream.open(HDFSCompressedDataStream.java:70) at org.apache.flume.sink.hdfs.BucketWriter.doOpen(BucketWriter.java:216) at org.apache.flume.sink.hdfs.BucketWriter.access$000(BucketWriter.java:53) at org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:172) at org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:170) at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:143) at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:170) at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:364) at org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:729) at org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:727) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Unfortunately the error does not state if it is because of lack of Heap, Perm or Direct Memory? Looking at the system memory we could see that we were using 3GB of 7GB (ie less than half of the physical memory was used) Using VisualVM profiler we could see that we had not maxed out the Heap Memory 75MB of 131MB (allocated) PermGen was fine 16MB of 27MB (allocated) Buffer Usage is as follows: Direct Memory: < 50MB (this gets freed after each GC) Mapped Memory: count 9 144MB (always stays constant) I'm assuming the -XX:MaxDirectMemorySize is for Direct Buffer Memory usage NOT Mapped buffer Memory? The other thing we noticed was that after restart the flume process "RES" size starts at around 200MB and then over a period of a week will grow up to 3GB after which we observed the above error. Unfortunately we cannot see where this 3GB of memory is being used when profiled with VisualVM and JConsole (max heap size is set to 256MB) - there definitely appears to be a slow memory leak? Flume is the only process running on this server: 64bit Centos java version "1.6.0_27" (64bit) The flume collector is configured with 8 file channels writing to S3 using the HDFS sink. (8 upstream servers a pushing events to 2 downsteam collectors) Each of the 8 channels/sinks is configured as follows: ## impression source agent.sources.impressions.type = avro agent.sources.impressions.bind = 0.0.0.0 agent.sources.impressions.port = 5001 agent.sources.impressions.channels = impressions-s3-channel ## impression channel agent.channels.impressions-s3-channel.type = file agent.channels.impressions-s3-channel.checkpointDir /mnt/flume-ng/checkpoint/impressions-s3-channel agent.channels.impressions-s3-channel.dataDirs /mnt/flume-ng/data1/impressions-s3-channel,/mnt/flume-ng/data2/impressions-s3-channel agent.channels.impressions-s3-channel.maxFileSize = 210000000 agent.channels.impressions-s3-channel.capacity = 2000000 agent.channels.impressions-s3-channel.checkpointInterval = 300000 agent.channels.impressions-s3-channel.transactionCapacity = 10000 # impression s3 sink agent.sinks.impressions-s3-sink.type = hdfs agent.sinks.impressions-s3-sink.channel = impressions-s3-channel agent.sinks.impressions-s3-sink.hdfs.path = s3n://KEY:SECRET_KEY@S3-PATH agent.sinks.impressions-s3-sink.hdfs.filePrefix impressions-%{collector-host} agent.sinks.impressions-s3-sink.hdfs.callTimeout = 0 agent.sinks.impressions-s3-sink.hdfs.rollInterval = 3600 agent.sinks.impressions-s3-sink.hdfs.rollSize = 450000000 agent.sinks.impressions-s3-sink.hdfs.rollCount = 0 agent.sinks.impressions-s3-sink.hdfs.codeC = gzip agent.sinks.impressions-s3-sink.hdfs.fileType = CompressedStream agent.sinks.impressions-s3-sink.hdfs.batchSize = 100 I am using flume-ng 1.3.1 with the following parameters: JAVA_OPTS="-Xms64m -Xmx256m -Xss128k -XX:MaxDirectMemorySize=256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc -Xloggc:/mnt/logs/flume-ng/gc.log" We have 2 collectors running and they both fail at pretty much the same time. So from what i can see there appears to be a slow memory leak with the HDFS sink, but have no idea how track this down or what alternate configuration i can use to prevent this from happening again? Any ideas would be greatly appreciated? |