Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> File channel performance on a single disk is poor


+
Juhani Connolly 2012-07-04, 07:07
Copy link to this message
-
Re: File channel performance on a single disk is poor
Hi Juhally,
while ago I've filled jira FLUME-1227 where I've suggested creating some sort of SpillableChannel that would behave similarly as scribe. It would be normally acting as memory channel and it would start spilling data to disk in case that it would get full (my primary goal here was to solve issue when remote goes down, for example in case of HDFS maintenance). Would it be helpful for your case?

Jarcec

On Wed, Jul 04, 2012 at 04:07:48PM +0900, Juhani Connolly wrote:
> Evaluating flume on some of our servers, the file channel seems very
> slow, likely because like most typical web servers ours have a
> single raided disk available for writing to.
>
> Quoted below is a suggestion from a  previous issue where our poor
> throughput came up, where it turns out that on multiple disks, file
> channel performance is great.
>
> On 06/27/2012 11:01 AM, Mike Percy wrote:
> >We are able to push > 8000 events/sec (2KB per event) through a single file channel if you put checkpoint on one disk and use 2 other disks for data dirs. Not sure what the limit is. This is using the latest trunk code. Other limitations may be you need to add additional sinks to your channel to drain it faster. This is because sinks are single threaded and sources are multithreaded.
> >
> >Mike
>
> For the case where the disks happen to be available on the server,
> that's fantastic, but I suspect that most use cases are going to be
> similar to ours, where multiple disks are not available. Our use
> case isn't unusual as it's primarily aggregating logs from various
> services.
>
> We originally ran our log servers with a exec(tail)->file->avro
> setup where throughput was very bad(80mb in an hour). We then
> switched this to a memory channel which was fine(the peak time 500mb
> worth of hourly logs went through). Afterwards we switched back to
> the file channel, but with 5 identical avro sinks. This did not
> improve throughput(still 80mb). RecoverableMemoryChannel showed very
> similar characteristics.
>
> I presume this is due to the writes going to two separate places,
> and being further compounded by also writing out and tailing the
> normal web logs: checking top and iostat, we could confirm we have
> significant iowait time, far more than we have during typical
> operation.
>
> As it is, we seem to be more or less guaranteeing no loss of logs
> with the file channel. Perhaps we could look into batching
> puts/takes for those that do not need 100% data retention but want
> more reliability than with the MemoryChannel which can potentially
> lose the entire capacity on a restart? Another possibility is
> writing an implementation that writes primarily sequentially. I've
> been meaning to get a deeper look at the implementation itself to
> give a more informed commentary on the contents but unfortunately
> don't have the cycles right now, hopefully someone with a better
> understanding of the current implementation(along with its
> interaction with the OS file cache) can comment on this.
>
+
Juhani Connolly 2012-07-04, 09:13
+
Jarek Jarcec Cecho 2012-07-04, 09:36
+
Juhani Connolly 2012-07-04, 10:33
+
Arvind Prabhakar 2012-07-09, 05:42
+
Juhani Connolly 2012-07-09, 06:14
+
Arvind Prabhakar 2012-07-10, 09:36
+
Juhani Connolly 2012-07-11, 02:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB