Bhaskar V. Karambelkar 2013-01-08, 18:46
Jeff Lord 2013-01-09, 02:40
Bhaskar V. Karambelkar 2013-01-11, 14:48
Jeff Lord 2013-01-11, 17:03
-Re: Of BatchSize / Channel Capacity / Transaction Capacity
Published in our wiki:
On Jan 11, 2013, at 6:03 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
> I have created the following jira for this:
> On Fri, Jan 11, 2013 at 6:48 AM, Bhaskar V. Karambelkar <[EMAIL PROTECTED]
>> Thanks Jeff,
>> Clear and detailed explanations. These deserve to be on the wiki, as these
>> parameters have direct implications on the performance of flume nodes.
>> On Tue, Jan 8, 2013 at 9:40 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
>>> Hi Bashkar,
>>> 1) Batch Size
>>> 1.a) When configured by client code using the flume-core-sdk , to send
>>> events to flume avro source.
>>> The flume client sdk has an appendBatch method. This will take a list of
>>> events and send them to the source as a batch. This is the size of the
>>> number of events to be passed to the source at one time.
>>> 1.b) When set as a parameter on HDFS sink (or other sinks which support
>>> BatchSize parameter)
>>> This is the number of events written to file before it is flushed to HDFS
>>> 2.a) Channel Capacity
>>> This is the maximum capacity number of events of the channel.
>>> 2.b) Channel Transaction Capacity.
>>> This is the max number of events stored in the channel per transaction.
>>> How will setting these parameters to different values, affect throughput,
>>> latency in event flow?
>>> In general you will see better throughput by using memory channel as
>>> opposed to using file channel at the loss of durability.
>>> The channel capacity is going to need to be sized such that it is large
>>> enough to hold as many events as will be added to it by upstream agents.
>>> Ideal flow would see the sink draining events from the channel faster than
>>> it is having events added by its source.
>>> The channel transaction capacity will need to be smaller than the channel
>>> e.g. If your Channel capacity is set to 10000 than Channel Transaction
>>> Capacity should be set to something like 100.
>>> Specifically if we have clients with varying frequency of event
>>> generation, i.e. some clients generating thousands of events/sec, while
>>> others at a much slower rate, what effect will different values of these
>>> params have on these clients ?
>>> Transaction Capacity is going to be what throttles or limits how many
>>> events the source can put into the channel. This going to vary depending on
>>> how many tiers of agents/collectors you have setup.
>>> In general though this should probably be equal to whatever you have the
>>> batch size set to in your client.
>>> With regards to the hdfs batch size, the larger your batch size the
>>> better performance will be. However, keep in mind that if a transaction
>>> fails the entire transaction will be replayed which could have the
>>> implication of duplicate events downstream.
>>> On Tue, Jan 8, 2013 at 10:46 AM, Bhaskar V. Karambelkar <
>>> [EMAIL PROTECTED]> wrote:
>>>> Can some one explain the importance of the following
>>>> 1) Batch Size
>>>> 1.a) When configured by client code using the flume-core-sdk , to send
>>>> events to flume avro source.
>>>> 1.b) When set as a parameter on HDFS sink (or other sinks which
>>>> support BatchSize parameter)
>>>> 2.a) Channel Capacity
>>>> 2.b) Channel Transaction Capacity.
>>>> Under which conditions should these params be set to high values, and
>>>> under which conditions should they be set to low values.
>>>> How will setting these parameters to different values, affect
>>>> throughput, latency in event flow.
>>>> Specifically if we have clients with varying frequency of event
>>>> generation, i.e. some clients generating thousands of events/sec, while
German Hadoop LinkedIn Group: http://goo.gl/N8pCF