Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - File channel performance on a single disk is poor


Copy link to this message
-
Re: File channel performance on a single disk is poor
Arvind Prabhakar 2012-07-09, 05:42
Hi,

> It's certainly one possible solution to the issue, though I do
> believe that the current one could be made more friendly
> towards single disk access(e.g. batching writes to the disk
> may well be doable and would be curious what someone
> with more familiarity with the implementation thinks).

The implementation of the file channel is that of a write ahead log, in
that it serializes all the actions as they happen. Using these actions, it
can reconstruct the state of the channel at anytime. There are two mutually
exclusive transaction types it supports - a transaction consisting of puts,
and one consisting of takes. It may be possible to use the heap to batch
the puts and takes and serialize them to disk when the commit occurs.

This approach will minimize the number of disk operations and will have an
impact on the performance characteristics of the channel. Although it
probably will improve performance, it is hard to tell for sure unless we
test it out under load in different scenarios.

Regards,
Arvind Prabhakar
On Wed, Jul 4, 2012 at 3:33 AM, Juhani Connolly <
[EMAIL PROTECTED]> wrote:

> It looks good to me as it provides a nice balance between reliability and
> throughput.
>
> It's certainly one possible solution to the issue, though I do believe
> that the current one could be made more friendly towards single disk
> access(e.g. batching writes to the disk may well be doable and would be
> curious what someone with more familiarity with the implementation thinks).
>
>
> On 07/04/2012 06:36 PM, Jarek Jarcec Cecho wrote:
>
>> We had connected discussion about this "SpillableChannel" (working name)
>> on FLUME-1045 and I believe that consensus is that we will create something
>> like that. In fact, I'm planning to do it myself in near future - I just
>> need to prioritize my todo list first.
>>
>> Jarcec
>>
>> On Wed, Jul 04, 2012 at 06:13:43PM +0900, Juhani Connolly wrote:
>>
>>> Yes... I was actually poking around for that issue as I remembered
>>> seeing it before.  I had before also suggested a compound channel
>>> that would have worked like the buffer store in scribe, but general
>>> opinion was that it provided too many mixed configurations that
>>> could make testings and verifying correctness difficult.
>>>
>>> On 07/04/2012 04:33 PM, Jarek Jarcec Cecho wrote:
>>>
>>>> Hi Juhally,
>>>> while ago I've filled jira FLUME-1227 where I've suggested creating
>>>> some sort of SpillableChannel that would behave similarly as scribe. It
>>>> would be normally acting as memory channel and it would start spilling data
>>>> to disk in case that it would get full (my primary goal here was to solve
>>>> issue when remote goes down, for example in case of HDFS maintenance).
>>>> Would it be helpful for your case?
>>>>
>>>> Jarcec
>>>>
>>>> On Wed, Jul 04, 2012 at 04:07:48PM +0900, Juhani Connolly wrote:
>>>>
>>>>> Evaluating flume on some of our servers, the file channel seems very
>>>>> slow, likely because like most typical web servers ours have a
>>>>> single raided disk available for writing to.
>>>>>
>>>>> Quoted below is a suggestion from a  previous issue where our poor
>>>>> throughput came up, where it turns out that on multiple disks, file
>>>>> channel performance is great.
>>>>>
>>>>> On 06/27/2012 11:01 AM, Mike Percy wrote:
>>>>>
>>>>>> We are able to push > 8000 events/sec (2KB per event) through a
>>>>>> single file channel if you put checkpoint on one disk and use 2 other disks
>>>>>> for data dirs. Not sure what the limit is. This is using the latest trunk
>>>>>> code. Other limitations may be you need to add additional sinks to your
>>>>>> channel to drain it faster. This is because sinks are single threaded and
>>>>>> sources are multithreaded.
>>>>>>
>>>>>> Mike
>>>>>>
>>>>> For the case where the disks happen to be available on the server,
>>>>> that's fantastic, but I suspect that most use cases are going to be
>>>>> similar to ours, where multiple disks are not available. Our use