Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> Of BatchSize / Channel Capacity / Transaction Capacity


+
Bhaskar V. Karambelkar 2013-01-08, 18:46
+
Jeff Lord 2013-01-09, 02:40
+
Bhaskar V. Karambelkar 2013-01-11, 14:48
+
Jeff Lord 2013-01-11, 17:03
Copy link to this message
-
Re: Of BatchSize / Channel Capacity / Transaction Capacity
Published in our wiki:
https://cwiki.apache.org/confluence/display/FLUME/BatchSize,+ChannelCapacity+and+ChannelTransactionCapacity+Properties

- Alex

On Jan 11, 2013, at 6:03 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:

> Bhaskar,
>
> I have created the following jira for this:
> https://issues.apache.org/jira/browse/FLUME-1829
>
> -Jeff
>
>
> On Fri, Jan 11, 2013 at 6:48 AM, Bhaskar V. Karambelkar <[EMAIL PROTECTED]
>> wrote:
>
>> Thanks Jeff,
>> Clear and detailed explanations. These deserve to be on the wiki, as these
>> parameters have direct implications on the performance of flume nodes.
>>
>> thanks
>> Bhaskar
>>
>>
>> On Tue, Jan 8, 2013 at 9:40 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Bashkar,
>>>
>>> 1) Batch Size
>>>  1.a) When configured by client code using the flume-core-sdk , to send
>>> events to flume avro source.
>>> The flume client sdk has an appendBatch method. This will take a list of
>>> events and send them to the source as a batch. This is the size of the
>>> number of events to be passed to the source at one time.
>>>
>>>  1.b) When set as a parameter on HDFS sink (or other sinks which support
>>> BatchSize parameter)
>>> This is the number of events written to file before it is flushed to HDFS
>>>
>>> 2)
>>>  2.a) Channel Capacity
>>> This is the maximum capacity number of events of the channel.
>>>
>>>  2.b) Channel Transaction Capacity.
>>> This is the max number of events stored in the channel per transaction.
>>>
>>> How will setting these parameters to different values, affect throughput,
>>> latency in event flow?
>>>
>>> In general you will see better throughput by using memory channel as
>>> opposed to using file channel at the loss of durability.
>>>
>>> The channel capacity is going to need to be sized such that it is large
>>> enough to hold as many events as will be added to it by upstream agents.
>>> Ideal flow would see the sink draining events from the channel faster than
>>> it is having events added by its source.
>>>
>>> The channel transaction capacity will need to be smaller than the channel
>>> capacity.
>>> e.g. If your Channel capacity is set to 10000 than Channel Transaction
>>> Capacity should be set to something like 100.
>>>
>>> Specifically if we have clients with varying frequency of event
>>> generation, i.e. some clients generating thousands of events/sec, while
>>> others at a much slower rate, what effect will different values of these
>>> params have on these clients ?
>>>
>>> Transaction Capacity is going to be what throttles or limits how many
>>> events the source can put into the channel. This going to vary depending on
>>> how many tiers of agents/collectors you have setup.
>>> In general though this should probably be equal to whatever you have the
>>> batch size set to in your client.
>>>
>>> With regards to the hdfs batch size, the larger your batch size the
>>> better performance will be. However, keep in mind that if a transaction
>>> fails the entire transaction will be replayed which could have the
>>> implication of duplicate events downstream.
>>>
>>> -Jeff
>>>
>>>
>>>
>>>
>>> On Tue, Jan 8, 2013 at 10:46 AM, Bhaskar V. Karambelkar <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Can some one explain the importance of the following
>>>> 1) Batch Size
>>>>  1.a) When configured by client code using the flume-core-sdk , to send
>>>> events to flume avro source.
>>>>  1.b) When set as a parameter on HDFS sink (or other sinks which
>>>> support BatchSize parameter)
>>>> 2)
>>>>  2.a) Channel Capacity
>>>>  2.b) Channel Transaction Capacity.
>>>>
>>>>
>>>> Under which conditions should these params be set to high values, and
>>>> under which conditions should they be set to low values.
>>>>
>>>>
>>>> How will setting these parameters to different values, affect
>>>> throughput, latency in event flow.
>>>> Specifically if we have clients with varying frequency of event
>>>> generation, i.e. some clients generating thousands of events/sec, while

Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF