Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> HDFS file rolling behaviour


Copy link to this message
-
HDFS file rolling behaviour
Hi

I use two flume agents:
1. flume_agent 1 which is a source with (exec source -file channel -avro
sink)
2. flume_agent 2 which is a dest with (avro source -file channel - HDFS
sink)

I have observed that for HDFS sink with rolling by *file size/number of
events* it
creates a lot of simultaneous connections to source's avro sink. But
while rolling by *time interval* it does it *one by one* i.e. opens 1
HDFS file write to
it and then close it.  I expect for other rolling intervals too same
thing should happen
i.e.  first open file and if x number of events are written to it then
roll it and open another
and so on.

In my case my data ingestion works fine with "time" based rolling but in
other
cases due to the above behaviour I get exceptions like:
-- too many open files
-- timeout related exceptions for file channel and few more exceptions.

I can increase the values of the parameters giving exceptions but I dont
know what
adverse effects it may have.

Can somebody throw some light on the rolling based on file size/number
of events ?

Regards,
Jagadish
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB