-Re: File channel configuration
Brock Noland 2012-10-31, 14:33
I would be careful about using SSD with FileChannel. People testing
Zookeeper on SSD have found the following
"SSDs have some really terrible corner cases for latency. I've seen
them take 40+ seconds (that's not a mistake - seconds) for fsync to
which would affect FileChannel as well.
On Tue, Oct 30, 2012 at 11:11 PM, Cameron Gandevia <[EMAIL PROTECTED]> wrote:
> Awesome thanks for the quick reply. Your answer to my last question makes sense, I figured that is how it worked just wanted to double check I didn't misunderstand anything.
> On 2012-10-30, at 8:24 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote:
>> Hi Cameron,
>> Answers inline.
>> On Tue, Oct 30, 2012 at 8:08 PM, Cameron Gandevia <[EMAIL PROTECTED]> wrote:
>>> I'm trying to figure out the best way to configure the file channel for
>>> maximum throughput and have a couple of questions.
>>> 1. What is the best hard disk layout? An ssd for the checkpoint directory
>>> and a separate disk for each file channel on the agent?
>> The checkpoint need not be on SSD. In fact, I'd put the data dirs on
>> SSD. That said, I have not tested either with SSD, neither do I know
>> anyone who has.
>> If you have only as many disks as channels, then one disk per channel
>> seems fine. If you can afford more disks per channel, put each data
>> dir on a different disk for better performance. The channel
>> round-robins between disks (though each txn will go to the same data
>>> 2. Can multiple discs be utilized for a single channel? I could only seem to
>>> configure a single data directory.
>> Yes, multiple disks can be used. In the dataDir config, pass in a
>> comma-separated list of data directories.
>>> 3. There is a comment in the documentation that mentions adding more sinks
>>> to drain the channel faster. If my final agent sink was hdfs does that mean
>>> configuring two hdfs sinks using a sink group to drain a single channel on
>>> an agent? I noticed you can configure thread pools on the hdfs sink but
>>> haven't looked into it.
>> No, in a sink group only one sink is active at any point in time. Sink
>> groups are meant for load balancing or fail over between the sinks in
>> the group. Either use multiple sinks or multiple sink groups.
>> Adding multiple sinks helps often since you have more threads doing
>> I/O and more sink runners reading from the channel.
>>> 4. Does it make sense to have my agent run two channels both with sinks
>>> writing to a single hdfs cluster each configured with q separate data disk
>>> and have the previous agent round robin deliver to it?
>> I am not sure why you need 2 channels (remember, a source will
>> replicate the events to each channel). Multiple channels are used to
>> bifurcate the flow etc. You can use a single channel, with multiple
>> disks, and multiple sinks to write to HDFS. You might want to play
>> with batch sizes and timeouts on HDFS Sink side - this often affects
>> performance a lot. Perhaps, I didn't get this question. If I didn't
>> could you clarify?
>>> Thanks for any input anyone has
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/