|
|
-
Re: File channel configurationCameron Gandevia 2012-10-31, 04:11
Hey
Awesome thanks for the quick reply. Your answer to my last question makes sense, I figured that is how it worked just wanted to double check I didn't misunderstand anything. Thanks On 2012-10-30, at 8:24 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: > Hi Cameron, > > Answers inline. > > Hari > > On Tue, Oct 30, 2012 at 8:08 PM, Cameron Gandevia <[EMAIL PROTECTED]> wrote: >> Hi >> >> I'm trying to figure out the best way to configure the file channel for >> maximum throughput and have a couple of questions. >> >> 1. What is the best hard disk layout? An ssd for the checkpoint directory >> and a separate disk for each file channel on the agent? > > The checkpoint need not be on SSD. In fact, I'd put the data dirs on > SSD. That said, I have not tested either with SSD, neither do I know > anyone who has. > > If you have only as many disks as channels, then one disk per channel > seems fine. If you can afford more disks per channel, put each data > dir on a different disk for better performance. The channel > round-robins between disks (though each txn will go to the same data > dir). >> >> 2. Can multiple discs be utilized for a single channel? I could only seem to >> configure a single data directory. > > Yes, multiple disks can be used. In the dataDir config, pass in a > comma-separated list of data directories. >> >> 3. There is a comment in the documentation that mentions adding more sinks >> to drain the channel faster. If my final agent sink was hdfs does that mean >> configuring two hdfs sinks using a sink group to drain a single channel on >> an agent? I noticed you can configure thread pools on the hdfs sink but >> haven't looked into it. > > No, in a sink group only one sink is active at any point in time. Sink > groups are meant for load balancing or fail over between the sinks in > the group. Either use multiple sinks or multiple sink groups. > > Adding multiple sinks helps often since you have more threads doing > I/O and more sink runners reading from the channel. > >> >> 4. Does it make sense to have my agent run two channels both with sinks >> writing to a single hdfs cluster each configured with q separate data disk >> and have the previous agent round robin deliver to it? > > I am not sure why you need 2 channels (remember, a source will > replicate the events to each channel). Multiple channels are used to > bifurcate the flow etc. You can use a single channel, with multiple > disks, and multiple sinks to write to HDFS. You might want to play > with batch sizes and timeouts on HDFS Sink side - this often affects > performance a lot. Perhaps, I didn't get this question. If I didn't > could you clarify? > >> >> Thanks for any input anyone has |