Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> File channel configuration


+
Cameron Gandevia 2012-10-31, 03:08
+
Hari Shreedharan 2012-10-31, 03:24
Copy link to this message
-
Re: File channel configuration
Hey

Awesome thanks for the quick reply. Your answer to my last question makes sense, I figured that is how it worked just wanted to double check I didn't misunderstand anything.

Thanks
On 2012-10-30, at 8:24 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote:

> Hi Cameron,
>
> Answers inline.
>
> Hari
>
> On Tue, Oct 30, 2012 at 8:08 PM, Cameron Gandevia <[EMAIL PROTECTED]> wrote:
>> Hi
>>
>> I'm trying to figure out the best way to configure the file channel for
>> maximum throughput and have a couple of questions.
>>
>> 1. What is the best hard disk layout? An ssd for the checkpoint directory
>> and a separate disk for each file channel on the agent?
>
> The checkpoint need not be on SSD. In fact, I'd put the data dirs on
> SSD. That said, I have not tested either with SSD, neither do I know
> anyone who has.
>
> If you have only as many disks as channels, then one disk per channel
> seems fine. If you can afford more disks per channel, put each data
> dir on a different disk for better performance. The channel
> round-robins between disks (though each txn will go to the same data
> dir).
>>
>> 2. Can multiple discs be utilized for a single channel? I could only seem to
>> configure a single data directory.
>
> Yes, multiple disks can be used. In the dataDir config, pass in a
> comma-separated list of data directories.
>>
>> 3. There is a comment in the documentation that mentions adding more sinks
>> to drain the channel faster. If my final agent sink was hdfs does that mean
>> configuring two hdfs sinks using a sink group to drain a single channel on
>> an agent? I noticed you can configure thread pools on the hdfs sink but
>> haven't looked into it.
>
> No, in a sink group only one sink is active at any point in time. Sink
> groups are meant for load balancing or fail over between the sinks in
> the group. Either use multiple sinks or multiple sink groups.
>
> Adding multiple sinks helps often since you have more threads doing
> I/O and more sink runners reading from the channel.
>
>>
>> 4. Does it make sense to have my agent run two channels both with sinks
>> writing to a single hdfs cluster each configured with q separate data disk
>> and have the previous agent round robin deliver to it?
>
> I am not sure why you need 2 channels (remember, a source will
> replicate the events to each channel). Multiple channels are used to
> bifurcate the flow etc. You can use a single channel, with multiple
> disks, and multiple sinks to write to HDFS. You might want to play
> with batch sizes and timeouts on HDFS Sink side - this often affects
> performance a lot. Perhaps, I didn't get this question. If I didn't
> could you clarify?
>
>>
>> Thanks for any input anyone has
+
Brock Noland 2012-10-31, 14:33