Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - File channel configuration


Copy link to this message
-
Re: File channel configuration
Hari Shreedharan 2012-10-31, 03:24
Hi Cameron,

Answers inline.

Hari

On Tue, Oct 30, 2012 at 8:08 PM, Cameron Gandevia <[EMAIL PROTECTED]> wrote:
> Hi
>
> I'm trying to figure out the best way to configure the file channel for
> maximum throughput and have a couple of questions.
>
> 1. What is the best hard disk layout? An ssd for the checkpoint directory
> and a separate disk for each file channel on the agent?

The checkpoint need not be on SSD. In fact, I'd put the data dirs on
SSD. That said, I have not tested either with SSD, neither do I know
anyone who has.

If you have only as many disks as channels, then one disk per channel
seems fine. If you can afford more disks per channel, put each data
dir on a different disk for better performance. The channel
round-robins between disks (though each txn will go to the same data
dir).
>
> 2. Can multiple discs be utilized for a single channel? I could only seem to
> configure a single data directory.

Yes, multiple disks can be used. In the dataDir config, pass in a
comma-separated list of data directories.
>
> 3. There is a comment in the documentation that mentions adding more sinks
> to drain the channel faster. If my final agent sink was hdfs does that mean
> configuring two hdfs sinks using a sink group to drain a single channel on
> an agent? I noticed you can configure thread pools on the hdfs sink but
> haven't looked into it.

No, in a sink group only one sink is active at any point in time. Sink
groups are meant for load balancing or fail over between the sinks in
the group. Either use multiple sinks or multiple sink groups.

Adding multiple sinks helps often since you have more threads doing
I/O and more sink runners reading from the channel.

>
> 4. Does it make sense to have my agent run two channels both with sinks
> writing to a single hdfs cluster each configured with q separate data disk
> and have the previous agent round robin deliver to it?

I am not sure why you need 2 channels (remember, a source will
replicate the events to each channel). Multiple channels are used to
bifurcate the flow etc. You can use a single channel, with multiple
disks, and multiple sinks to write to HDFS. You might want to play
with batch sizes and timeouts on HDFS Sink side - this often affects
performance a lot. Perhaps, I didn't get this question. If I didn't
could you clarify?

>
> Thanks for any input anyone has