Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> HDFSEventSink Memory Leak Workarounds


Copy link to this message
-
Re: HDFSEventSink Memory Leak Workarounds
The other property you will want to look at is maxOpenFiles, which is the
number of file/paths held in memory at one time.

If you search for the email thread with subject "hdfs.idleTimeout ,what's
it used for ?" from back in January you will find a discussion along these
lines. As a quick summary, if rollInterval is not set to 0, you should
avoid using idleTimeout and should set maxOpenFiles to a reasonable number
(the default is 500 which is too large; I think that default is changed for
1.4).

- Connor
On Tue, May 21, 2013 at 9:59 AM, Tim Driscoll <[EMAIL PROTECTED]>wrote:

> Hello,
>
> We have a Flume Agent (version 1.3.1) set up using the HDFSEventSink.  We
> were noticing that we were running out of memory after a few days of
> running, and believe we had pinpointed it to an issue with using the
> hdfs.idleTimeout setting.  I believe this is fixed in 1.4 per FLUME-1864.
>
> Our planned workaround was to just remove the idleTimeout setting, which
> worked, but brought up another issue.  Since we are partitioning our data
> by timestamp, at midnight, we rolled over to a new bucket/partition, opened
> new bucket writers, and left the current bucket writers open.  Ideally the
> idleTimeout would clean this up.  So instead of a slow steady leak, we're
> encountering a 100MB leak every day.
>
> Short of upgrading Flume, does anyone know of a configuration workaround
> for this?  Currently we just bumped up the heap memory and I'm having to
> restart our agents every few days, which obviously isn't ideal.
>
> Is anyone else seeing issues like this?  Or how do others use the HDFS
> sink to continuously write large amounts of logs from multiple source
> hosts?  I can get more in-depth about our setup/environment if necessary.
>
> Here's a snippet of the one of  our 4 HDFS Sink configs:
> agent.sinks.rest-xaction-hdfs-sink.type = hdfs
> agent.sinks.rest-xaction-hdfs-sink.channel = rest-xaction-chan
> agent.sinks.rest-xaction-hdfs-sink.hdfs.path > /user/svc-neb/rest_xaction_logs/date=%Y-%m-%d
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollCount = 0
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollSize = 0
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollInterval = 3600
> agent.sinks.rest-xaction-hdfs-sink.hdfs.idleTimeout = 300
> agent.sinks.rest-xaction-hdfs-sink.hdfs.batchSize = 1000
> agent.sinks.rest-xaction-hdfs-sink.hdfs.filePrefix = %{host}
> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileSuffix = .avro
> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileType = DataStream
> agent.sinks.rest-xaction-hdfs-sink.serializer = avro_event
>
> -Tim
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB