Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - File channel performance on a single disk is poor


Copy link to this message
-
Re: File channel performance on a single disk is poor
Juhani Connolly 2012-07-09, 06:14
Hi, thanks for your input.

On 07/09/2012 02:42 PM, Arvind Prabhakar wrote:
> Hi,
>
> > It's certainly one possible solution to the issue, though I do
> > believe that the current one could be made more friendly
> > towards single disk access(e.g. batching writes to the disk
> > may well be doable and would be curious what someone
> > with more familiarity with the implementation thinks).
>
> The implementation of the file channel is that of a write ahead log,
> in that it serializes all the actions as they happen. Using these
> actions, it can reconstruct the state of the channel at anytime. There
> are two mutually exclusive transaction types it supports - a
> transaction consisting of puts, and one consisting of takes. It may be
> possible to use the heap to batch the puts and takes and serialize
> them to disk when the commit occurs.
>
> This approach will minimize the number of disk operations and will
> have an impact on the performance characteristics of the channel.
> Although it probably will improve performance, it is hard to tell for
> sure unless we test it out under load in different scenarios.
>

This does sound a lot better to me. I'm not sure if there is much demand
for restoring the state of an uncommitted set of puts/takes to a file
channel after restarting an agent? If the transaction wasn't completed  
its current state  is not really going to be important after a restart.
I'm really not familiar with WAL implementations, but is it not merely
enough to write the data to be committed before the commit
marker/informing of success? I don't think it is necessary to write each
piece as it comes in, so long as it is done before informing of
success/failure.

Another matter that I'm curious of is whether or not we actually need
separate files for the data and checkpoints... Can we not add a magic
header before each type of entry to differentiate, and thus guarantee
significantly more sequential access? What is killing performance on a
single disk right now is the constant seeks. The problem with this
though would be putting together a file format that allows quick seeking
through to the correct position, and rolling would be a lot harder. I
think this is a lot more difficult and might be more of a long term target.

Juhani

> Regards,
> Arvind Prabhakar
>
>
> On Wed, Jul 4, 2012 at 3:33 AM, Juhani Connolly
> <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     It looks good to me as it provides a nice balance between
>     reliability and throughput.
>
>     It's certainly one possible solution to the issue, though I do
>     believe that the current one could be made more friendly towards
>     single disk access(e.g. batching writes to the disk may well be
>     doable and would be curious what someone with more familiarity
>     with the implementation thinks).
>
>
>     On 07/04/2012 06:36 PM, Jarek Jarcec Cecho wrote:
>
>         We had connected discussion about this "SpillableChannel"
>         (working name) on FLUME-1045 and I believe that consensus is
>         that we will create something like that. In fact, I'm planning
>         to do it myself in near future - I just need to prioritize my
>         todo list first.
>
>         Jarcec
>
>         On Wed, Jul 04, 2012 at 06:13:43PM +0900, Juhani Connolly wrote:
>
>             Yes... I was actually poking around for that issue as I
>             remembered
>             seeing it before.  I had before also suggested a compound
>             channel
>             that would have worked like the buffer store in scribe,
>             but general
>             opinion was that it provided too many mixed configurations
>             that
>             could make testings and verifying correctness difficult.
>
>             On 07/04/2012 04:33 PM, Jarek Jarcec Cecho wrote:
>
>                 Hi Juhally,
>                 while ago I've filled jira FLUME-1227 where I've
>                 suggested creating some sort of SpillableChannel that