Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> File Channel  performance and fsync

Copy link to this message
Re: File Channel performance and fsync
I missed this initially due to filters putting the ML cced enter in a
different folder...

Anyway, you didn't post your first tiers conf, can you post that? Is
that also using a file channel? Regardless, what is important at the
second tier is that the batch size that arrives at your collector node
is *not* the batch size from your first tiers source, it is the batch
size designated at your first tiers avro-sink(the avro-sink decides how
many messages to pull from the channel and then dumps them to the source
on the next tier)

So if you haven't configged that, or it is low, you will have poor
performance on tier 2 file channel.

We get many mb/s using file channel, though I haven't checked the
figures out lately.

Is your flume 1.2.0 the cloudera release? Or is it the raw one? I
vaguely remember something important(to us at least) missing from it and
we just use a packaged version I maintain. 1.3.0 should be released soon
if you can wait for that, I don't see any major issues with that, or you
could even just pull the current 1.3 head and compile that.

On 10/23/2012 03:40 PM, Jagadish Bihani wrote:
> Hi Brock
> I am using flume 1.2.0.
> About the batching : as per user guide "exec source" does have batch
> option in 1.2.0 (param name:
> batchSize and default value:20) and I
> have tried it. Apparently it works fine. And file channel has
> parameter "transactionCapacity" set
> to 1000 by default. Is that the batch size of file channel?
> Anyway even with increased batching I couldn't cross 110-150 KB/sec
> with File Channel.
> Could you please help me understanding questions I asked in the
> original mail of this thread about
> fsync lies. Because with disk which "apparently does fsync lie" I get
> 3 MB/sec in 1 flow.
> I don't know whether that actually does "fsync lie" but there is
> remarkable difference in fsync
> performance on 2 machines which do have almost similar hardware.
> Regards
> Jagadish
> On 10/22/2012 07:59 PM, Brock Noland wrote:
>> In this cae, it's best to think about FileChannel as if it were a
>> database. Let's pretend we are going to insert 1 million rows. If we
>> committed on each row, would performance be "good"?  No, everyone
>> knows that when you are inserting rows in databases, you want to
>> batch 100-1000 rows into a single commit, if you want "good"
>> performance. (Quoting good because it's subjective based on
>> the scenario, but in this case we mean lots of MB/second).
>> Part of the reason behind this logic is that when a database does a
>> commit, it does an fsync operation to ensure that all data is written
>> to disk and that you will not lose data due to a subsequent power loss.
>> FileChannel behaves *exactly* the same. If your "batch" is only a
>> single event, file channel will:
>> write single event
>> fsync
>> write single event
>> fsync
>> As such, if you want "good" performance with FileChannel, you must
>> increase your batch size, just like a database. If you have a
>> batchSize of say 100, then FileChannel will:
>> write single event 0
>> write single event 1
>> ...
>> write single event 99
>> fsync
>> Which will result in much "better" performance. It's worth noting
>> that ExecSource in Flume 1.2, does not have a batchSize and as such
>> each event is written and then committed. ExecSource in flume 1.3,
>> which we will release soon, does have a configurable batchSize. If
>> you want to try that out you can build it from the flume-1.3.0 branch.
>> Brock
>> On Mon, Oct 22, 2012 at 8:59 AM, Brock Noland <[EMAIL PROTECTED]
>> <mailto:[EMAIL PROTECTED]>> wrote:
>>     Which version? 1.2 or trunk?
>>     On Monday, October 22, 2012 at 8:18 AM, Jagadish Bihani wrote:
>>>     Hi
>>>     This is the simplistic configuration with which I am getting
>>>     lower performance.
>>>     Even with 2-tier architecture (cat source - avro sinks - avro
>>>     source- HDFS sink)
>>>     I get the similar performance with file channel.