Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - HDFSEventSink Memory Leak Workarounds


Copy link to this message
-
Re: HDFSEventSink Memory Leak Workarounds
Connor Woodson 2013-05-21, 21:12
The other property you will want to look at is maxOpenFiles, which is the
number of file/paths held in memory at one time.

If you search for the email thread with subject "hdfs.idleTimeout ,what's
it used for ?" from back in January you will find a discussion along these
lines. As a quick summary, if rollInterval is not set to 0, you should
avoid using idleTimeout and should set maxOpenFiles to a reasonable number
(the default is 500 which is too large; I think that default is changed for
1.4).

- Connor
On Tue, May 21, 2013 at 9:59 AM, Tim Driscoll <[EMAIL PROTECTED]>wrote:

> Hello,
>
> We have a Flume Agent (version 1.3.1) set up using the HDFSEventSink.  We
> were noticing that we were running out of memory after a few days of
> running, and believe we had pinpointed it to an issue with using the
> hdfs.idleTimeout setting.  I believe this is fixed in 1.4 per FLUME-1864.
>
> Our planned workaround was to just remove the idleTimeout setting, which
> worked, but brought up another issue.  Since we are partitioning our data
> by timestamp, at midnight, we rolled over to a new bucket/partition, opened
> new bucket writers, and left the current bucket writers open.  Ideally the
> idleTimeout would clean this up.  So instead of a slow steady leak, we're
> encountering a 100MB leak every day.
>
> Short of upgrading Flume, does anyone know of a configuration workaround
> for this?  Currently we just bumped up the heap memory and I'm having to
> restart our agents every few days, which obviously isn't ideal.
>
> Is anyone else seeing issues like this?  Or how do others use the HDFS
> sink to continuously write large amounts of logs from multiple source
> hosts?  I can get more in-depth about our setup/environment if necessary.
>
> Here's a snippet of the one of  our 4 HDFS Sink configs:
> agent.sinks.rest-xaction-hdfs-sink.type = hdfs
> agent.sinks.rest-xaction-hdfs-sink.channel = rest-xaction-chan
> agent.sinks.rest-xaction-hdfs-sink.hdfs.path > /user/svc-neb/rest_xaction_logs/date=%Y-%m-%d
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollCount = 0
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollSize = 0
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollInterval = 3600
> agent.sinks.rest-xaction-hdfs-sink.hdfs.idleTimeout = 300
> agent.sinks.rest-xaction-hdfs-sink.hdfs.batchSize = 1000
> agent.sinks.rest-xaction-hdfs-sink.hdfs.filePrefix = %{host}
> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileSuffix = .avro
> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileType = DataStream
> agent.sinks.rest-xaction-hdfs-sink.serializer = avro_event
>
> -Tim
>