Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> File channel performance on a single disk is poor


Copy link to this message
-
Re: File channel performance on a single disk is poor
Hi, thanks for clarifying.

On 07/10/2012 06:36 PM, Arvind Prabhakar wrote:
> Hi,
>
> On Sun, Jul 8, 2012 at 11:14 PM, Juhani Connolly
> <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Another matter that I'm curious of is whether or not we actually
>     need separate files for the data and checkpoints...
>
>
> The data file and checkpoint files serve different purpose. Checkpoint
> resides in memory and simulates the channel. The only difference is
> that it does not store the data in the queue itself, but pointers to
> data that resides in the log files. As a result the memory footprint
> of the checkpoint is very small regardless of how big each event
> payload is. This size only depends upon the capacity of the channel
> and nothing else.
This is more or less what I expected. Am I correct in believing that
each commit has to has to seek back and forth to two different files?
This would make all access on a single disk non-sequential.

>     Can we not add a magic header before each type of entry to
>     differentiate, and thus guarantee significantly more sequential
>     access?
>
>
> In the general case access will be sequential. In the best case, the
> channel will have moved the writes to new log files and continue to do
> reads from old (rolled) files which reduce seek contention. From what
> I know, I don't think it will be trivial to affect your suggested
> change without significantly impacting the entire logic of the channel.

I'm not understanding how it reduces the seek contention if the files
are all on the same disk? I don't think the reads are that painful,a lot
of it is hopefully taken care of by the os cache...

Implementation would likely be difficult, yes. I've only had an overview
look at the code, but haven't tried to do it because of this. As you
suggest it might be better to have a separate implementation.
>
>     What is killing performance on a single disk right now is the
>     constant seeks. The problem with this though would be putting
>     together a file format that allows quick seeking through to the
>     correct position, and rolling would be a lot harder. I think this
>     is a lot more difficult and might be more of a long term target.
>
>
> Perhaps what you are describing is a different type of persistent
> channel that is optimized for high latency IO systems. I would
> encourage you to take your idea one step further and see if that can
> be drafted as yet another channel that serves this particular use-case.
>

I'd like to do this, though it seems quite involved. Hopefully I can get
some time to figure it out later along the road. Jarcecs spillable
channel should also help on this front.

For the time being, I've resolved the issue for us with a workaround by
limiting the number of commits(by making ExecSource commit multiple
entries at a time).

My concern is that FileChannel is represented by a number of people as
having good performance, when at current time it depends on one of two
things being the case for that: multiple disks, or batched transactions.

Thanks,
  Juhani Connolly

> Regards,
> Arvind Prabhakar
>
>
>
>     Juhani
>
>
>>     Regards,
>>     Arvind Prabhakar
>>
>>
>>     On Wed, Jul 4, 2012 at 3:33 AM, Juhani Connolly
>>     <[EMAIL PROTECTED]
>>     <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>         It looks good to me as it provides a nice balance between
>>         reliability and throughput.
>>
>>         It's certainly one possible solution to the issue, though I
>>         do believe that the current one could be made more friendly
>>         towards single disk access(e.g. batching writes to the disk
>>         may well be doable and would be curious what someone with
>>         more familiarity with the implementation thinks).
>>
>>
>>         On 07/04/2012 06:36 PM, Jarek Jarcec Cecho wrote:
>>
>>             We had connected discussion about this "SpillableChannel"
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB