|
Shara Shi
2012-08-27, 09:26
Denny Ye
2012-08-27, 12:04
Shara Shi
2012-08-28, 01:59
Denny Ye
2012-08-28, 03:02
Shara Shi
2012-08-28, 03:19
Mohit Anchlia
2012-08-28, 04:48
Shara Shi
2012-08-28, 05:08
Patrick Wendell
2012-08-28, 05:11
Shara Shi
2012-08-28, 05:42
Brock Noland
2012-08-28, 11:47
|
-
HDFS SINK PerformacneShara Shi 2012-08-27, 09:26
Hi All,
Whatever I have tuned parameters of hdfs sink, It can't get higher performance over than 20MB per minutes. Is that normal? I think it is weird. How can I improve it Regards Ruihong Shi ========================================= # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # Define a memory channel called ch1 on collector1 collector2.channels.ch2.type = memory collector2.channels.ch2.capacity=500000 collector2.channels.ch2.keep-alive=1 # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. collector2.sources.avro-source1.channels = ch2 collector2.sources.avro-source1.type = avro collector2.sources.avro-source1.bind = 0.0.0.0 collector2.sources.avro-source1.port = 41415 collector2.sources.avro-soruce1.threads = 10 # Define a hdfs sink collector2.sinks.hdfs.channel = ch2 collector2.sinks.hdfs.type= hdfs collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata /exec/%Y/%m/%d/%H collector2.sinks.hdfs.batchsize=50000 collector2.sinks.hdfs.runner.type=polling collector2.sinks.hdfs.runner.polling.interval = 1 collector2.sinks.hdfs.hdfs.rollInterval = 120 collector2.sinks.hdfs.hdfs.rollSize =0 collector2.sinks.hdfs.hdfs.rollCount = 300000 collector2.sinks.hdfs.hdfs.fileType=DataStream collector2.sinks.hdfs.hdfs.round =true collector2.sinks.hdfs.hdfs.roundValue = 10 collector2.sinks.hdfs.hdfs.roundUnit = minute collector2.sinks.hdfs.hdfs.threadsPoolSize = 10 collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10 # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate.
-
Re: HDFS SINK PerformacneDenny Ye 2012-08-27, 12:04
hi Shara,
You are using MemoryChannel as repository. I tested it with outcomes: 45MB/sec without full GC in local updated code. Is this your goal? or more high throughput? -Regards Denny Ye 2012/8/27 Shara Shi <[EMAIL PROTECTED]> > Hi All, **** > > ** ** > > Whatever I have tuned parameters of hdfs sink, It can’t get higher > performance over than 20MB per minutes.**** > > Is that normal? I think it is weird.**** > > How can I improve it**** > > ** ** > > Regards**** > > Ruihong Shi**** > > ==========================================**** > > ** ** > > # or more contributor license agreements. See the NOTICE file**** > > # distributed with this work for additional information**** > > # regarding copyright ownership. The ASF licenses this file**** > > # to you under the Apache License, Version 2.0 (the**** > > # "License"); you may not use this file except in compliance**** > > # with the License. You may obtain a copy of the License at**** > > #**** > > # http://www.apache.org/licenses/LICENSE-2.0**** > > #**** > > # Unless required by applicable law or agreed to in writing,**** > > # software distributed under the License is distributed on an**** > > # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY**** > > # KIND, either express or implied. See the License for the**** > > # specific language governing permissions and limitations**** > > # under the License.**** > > ** ** > > # Define a memory channel called ch1 on collector1**** > > collector2.channels.ch2.type = memory**** > > collector2.channels.ch2.capacity=500000**** > > collector2.channels.ch2.keep-alive=1**** > > ** ** > > ** ** > > # Define an Avro source called avro-source1 on agent1 and tell it**** > > # to bind to 0.0.0.0:41414. Connect it to channel ch1.**** > > collector2.sources.avro-source1.channels = ch2**** > > collector2.sources.avro-source1.type = avro**** > > collector2.sources.avro-source1.bind = 0.0.0.0**** > > collector2.sources.avro-source1.port = 41415**** > > collector2.sources.avro-soruce1.threads = 10**** > > ** ** > > ** ** > > # Define a hdfs sink**** > > collector2.sinks.hdfs.channel = ch2**** > > collector2.sinks.hdfs.type= hdfs**** > > > collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata/exec/%Y/%m/%d/%H > **** > > collector2.sinks.hdfs.batchsize=50000**** > > collector2.sinks.hdfs.runner.type=polling**** > > collector2.sinks.hdfs.runner.polling.interval = 1**** > > collector2.sinks.hdfs.hdfs.rollInterval = 120**** > > collector2.sinks.hdfs.hdfs.rollSize =0**** > > collector2.sinks.hdfs.hdfs.rollCount = 300000**** > > collector2.sinks.hdfs.hdfs.fileType=DataStream**** > > collector2.sinks.hdfs.hdfs.round =true**** > > collector2.sinks.hdfs.hdfs.roundValue = 10**** > > collector2.sinks.hdfs.hdfs.roundUnit = minute**** > > collector2.sinks.hdfs.hdfs.threadsPoolSize = 10**** > > collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10**** > > ** ** > > # Finally, now that we've defined all of our components, tell**** > > # agent1 which ones we want to activate.**** >
-
答复: HDFS SINK PerformacneShara Shi 2012-08-28, 01:59
Hi Denny
The throughput is 45MB/sec is OK for me . But I just got 20M / Minutes What’s wrong with my configuration? Regards Shara 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月27日 20:05 收件人: [EMAIL PROTECTED] 主题: Re: HDFS SINK Performacne hi Shara, You are using MemoryChannel as repository. I tested it with outcomes: 45MB/sec without full GC in local updated code. Is this your goal? or more high throughput? -Regards Denny Ye 2012/8/27 Shara Shi <[EMAIL PROTECTED]> Hi All, Whatever I have tuned parameters of hdfs sink, It can’t get higher performance over than 20MB per minutes. Is that normal? I think it is weird. How can I improve it Regards Ruihong Shi ========================================= # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # Define a memory channel called ch1 on collector1 collector2.channels.ch2.type = memory collector2.channels.ch2.capacity=500000 collector2.channels.ch2.keep-alive=1 # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. collector2.sources.avro-source1.channels = ch2 collector2.sources.avro-source1.type = avro collector2.sources.avro-source1.bind = 0.0.0.0 collector2.sources.avro-source1.port = 41415 collector2.sources.avro-soruce1.threads = 10 # Define a hdfs sink collector2.sinks.hdfs.channel = ch2 collector2.sinks.hdfs.type= hdfs collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata /exec/%Y/%m/%d/%H collector2.sinks.hdfs.batchsize=50000 collector2.sinks.hdfs.runner.type=polling collector2.sinks.hdfs.runner.polling.interval = 1 collector2.sinks.hdfs.hdfs.rollInterval = 120 collector2.sinks.hdfs.hdfs.rollSize =0 collector2.sinks.hdfs.hdfs.rollCount = 300000 collector2.sinks.hdfs.hdfs.fileType=DataStream collector2.sinks.hdfs.hdfs.round =true collector2.sinks.hdfs.hdfs.roundValue = 10 collector2.sinks.hdfs.hdfs.roundUnit = minute collector2.sinks.hdfs.hdfs.threadsPoolSize = 10 collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10 # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate.
-
Re: 答复: HDFS SINK PerformacneDenny Ye 2012-08-28, 03:02
20MB/min or 20MB/sec?
I doubt that it may have presentation mistake. Can you confirm it? -Regards Denny Ye 2012/8/28 Shara Shi <[EMAIL PROTECTED]> > Hi Denny**** > > ** ** > > The throughput is 45MB/sec is OK for me . **** > > But I just got 20M / Minutes **** > > What’s wrong with my configuration?**** > > ** ** > > Regards**** > > Shara**** > > ** ** > > ** ** > > *发件人:* Denny Ye [mailto:[EMAIL PROTECTED]] > *发送时间:* 2012年8月27日 20:05 > *收件人:* [EMAIL PROTECTED] > *主题:* Re: HDFS SINK Performacne**** > > ** ** > > hi Shara,**** > > You are using MemoryChannel as repository. I tested it with outcomes: > 45MB/sec without full GC in local updated code. Is this your goal? or more > high throughput?**** > > ** ** > > -Regards**** > > Denny Ye**** > > 2012/8/27 Shara Shi <[EMAIL PROTECTED]>**** > > Hi All, **** > > **** > > Whatever I have tuned parameters of hdfs sink, It can’t get higher > performance over than 20MB per minutes.**** > > Is that normal? I think it is weird.**** > > How can I improve it**** > > **** > > Regards**** > > Ruihong Shi**** > > ==========================================**** > > **** > > # or more contributor license agreements. See the NOTICE file**** > > # distributed with this work for additional information**** > > # regarding copyright ownership. The ASF licenses this file**** > > # to you under the Apache License, Version 2.0 (the**** > > # "License"); you may not use this file except in compliance**** > > # with the License. You may obtain a copy of the License at**** > > #**** > > # http://www.apache.org/licenses/LICENSE-2.0**** > > #**** > > # Unless required by applicable law or agreed to in writing,**** > > # software distributed under the License is distributed on an**** > > # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY**** > > # KIND, either express or implied. See the License for the**** > > # specific language governing permissions and limitations**** > > # under the License.**** > > **** > > # Define a memory channel called ch1 on collector1**** > > collector2.channels.ch2.type = memory**** > > collector2.channels.ch2.capacity=500000**** > > collector2.channels.ch2.keep-alive=1**** > > **** > > **** > > # Define an Avro source called avro-source1 on agent1 and tell it**** > > # to bind to 0.0.0.0:41414. Connect it to channel ch1.**** > > collector2.sources.avro-source1.channels = ch2**** > > collector2.sources.avro-source1.type = avro**** > > collector2.sources.avro-source1.bind = 0.0.0.0**** > > collector2.sources.avro-source1.port = 41415**** > > collector2.sources.avro-soruce1.threads = 10**** > > **** > > **** > > # Define a hdfs sink**** > > collector2.sinks.hdfs.channel = ch2**** > > collector2.sinks.hdfs.type= hdfs**** > > > collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata/exec/%Y/%m/%d/%H > **** > > collector2.sinks.hdfs.batchsize=50000**** > > collector2.sinks.hdfs.runner.type=polling**** > > collector2.sinks.hdfs.runner.polling.interval = 1**** > > collector2.sinks.hdfs.hdfs.rollInterval = 120**** > > collector2.sinks.hdfs.hdfs.rollSize =0**** > > collector2.sinks.hdfs.hdfs.rollCount = 300000**** > > collector2.sinks.hdfs.hdfs.fileType=DataStream**** > > collector2.sinks.hdfs.hdfs.round =true**** > > collector2.sinks.hdfs.hdfs.roundValue = 10**** > > collector2.sinks.hdfs.hdfs.roundUnit = minute**** > > collector2.sinks.hdfs.hdfs.threadsPoolSize = 10**** > > collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10**** > > **** > > # Finally, now that we've defined all of our components, tell**** > > # agent1 which ones we want to activate.**** > > ** ** >
-
答复: 答复: HDFS SINK PerformacneShara Shi 2012-08-28, 03:19
Hi Denny
It is 20MB /min , I confirmed I sent data from avro-client from local to flume agent , I really got 20MB/min So I try to find out the reason why. Regards Shara 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月28日 11:02 收件人: [EMAIL PROTECTED] 主题: Re: 答复: HDFS SINK Performacne 20MB/min or 20MB/sec? I doubt that it may have presentation mistake. Can you confirm it? -Regards Denny Ye 2012/8/28 Shara Shi <[EMAIL PROTECTED]> Hi Denny The throughput is 45MB/sec is OK for me . But I just got 20M / Minutes What’s wrong with my configuration? Regards Shara 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月27日 20:05 收件人: [EMAIL PROTECTED] 主题: Re: HDFS SINK Performacne hi Shara, You are using MemoryChannel as repository. I tested it with outcomes: 45MB/sec without full GC in local updated code. Is this your goal? or more high throughput? -Regards Denny Ye 2012/8/27 Shara Shi <[EMAIL PROTECTED]> Hi All, Whatever I have tuned parameters of hdfs sink, It can’t get higher performance over than 20MB per minutes. Is that normal? I think it is weird. How can I improve it Regards Ruihong Shi ========================================= # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # Define a memory channel called ch1 on collector1 collector2.channels.ch2.type = memory collector2.channels.ch2.capacity=500000 collector2.channels.ch2.keep-alive=1 # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. collector2.sources.avro-source1.channels = ch2 collector2.sources.avro-source1.type = avro collector2.sources.avro-source1.bind = 0.0.0.0 collector2.sources.avro-source1.port = 41415 collector2.sources.avro-soruce1.threads = 10 # Define a hdfs sink collector2.sinks.hdfs.channel = ch2 collector2.sinks.hdfs.type= hdfs collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata /exec/%Y/%m/%d/%H collector2.sinks.hdfs.batchsize=50000 collector2.sinks.hdfs.runner.type=polling collector2.sinks.hdfs.runner.polling.interval = 1 collector2.sinks.hdfs.hdfs.rollInterval = 120 collector2.sinks.hdfs.hdfs.rollSize =0 collector2.sinks.hdfs.hdfs.rollCount = 300000 collector2.sinks.hdfs.hdfs.fileType=DataStream collector2.sinks.hdfs.hdfs.round =true collector2.sinks.hdfs.hdfs.roundValue = 10 collector2.sinks.hdfs.hdfs.roundUnit = minute collector2.sinks.hdfs.hdfs.threadsPoolSize = 10 collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10 # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate.
-
Re: 答复: 答复: HDFS SINK PerformacneMohit Anchlia 2012-08-28, 04:48
Do you get better performance when you directly write to the cluster? Can
you perform some tests writing to cluster directly and compare? On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <[EMAIL PROTECTED]> wrote: > Hi Denny**** > > ** ** > > It is 20MB /min , I confirmed **** > > I sent data from avro-client from local to flume agent , I really got > 20MB/min**** > > So I try to find out the reason why. **** > > ** ** > > Regards **** > > Shara**** > > *发件人:* Denny Ye [mailto:[EMAIL PROTECTED]] > *发送时间:* 2012年8月28日 11:02 > *收件人:* [EMAIL PROTECTED] > *主题:* Re: 答复: HDFS SINK Performacne**** > > ** ** > > 20MB/min or 20MB/sec?**** > > I doubt that it may have presentation mistake. Can you confirm it?**** > > ** ** > > -Regards**** > > Denny Ye**** > > 2012/8/28 Shara Shi <[EMAIL PROTECTED]>**** > > Hi Denny**** > > **** > > The throughput is 45MB/sec is OK for me . **** > > But I just got 20M / Minutes **** > > What’s wrong with my configuration?**** > > **** > > Regards**** > > Shara**** > > **** > > **** > > *发件人:* Denny Ye [mailto:[EMAIL PROTECTED]] > *发送时间:* 2012年8月27日 20:05 > *收件人:* [EMAIL PROTECTED] > *主题:* Re: HDFS SINK Performacne**** > > **** > > hi Shara,**** > > You are using MemoryChannel as repository. I tested it with outcomes: > 45MB/sec without full GC in local updated code. Is this your goal? or more > high throughput?**** > > **** > > -Regards**** > > Denny Ye**** > > 2012/8/27 Shara Shi <[EMAIL PROTECTED]>**** > > Hi All, **** > > **** > > Whatever I have tuned parameters of hdfs sink, It can’t get higher > performance over than 20MB per minutes.**** > > Is that normal? I think it is weird.**** > > How can I improve it**** > > **** > > Regards**** > > Ruihong Shi**** > > ==========================================**** > > **** > > # or more contributor license agreements. See the NOTICE file**** > > # distributed with this work for additional information**** > > # regarding copyright ownership. The ASF licenses this file**** > > # to you under the Apache License, Version 2.0 (the**** > > # "License"); you may not use this file except in compliance**** > > # with the License. You may obtain a copy of the License at**** > > #**** > > # http://www.apache.org/licenses/LICENSE-2.0**** > > #**** > > # Unless required by applicable law or agreed to in writing,**** > > # software distributed under the License is distributed on an**** > > # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY**** > > # KIND, either express or implied. See the License for the**** > > # specific language governing permissions and limitations**** > > # under the License.**** > > **** > > # Define a memory channel called ch1 on collector1**** > > collector2.channels.ch2.type = memory**** > > collector2.channels.ch2.capacity=500000**** > > collector2.channels.ch2.keep-alive=1**** > > **** > > **** > > # Define an Avro source called avro-source1 on agent1 and tell it**** > > # to bind to 0.0.0.0:41414. Connect it to channel ch1.**** > > collector2.sources.avro-source1.channels = ch2**** > > collector2.sources.avro-source1.type = avro**** > > collector2.sources.avro-source1.bind = 0.0.0.0**** > > collector2.sources.avro-source1.port = 41415**** > > collector2.sources.avro-soruce1.threads = 10**** > > **** > > **** > > # Define a hdfs sink**** > > collector2.sinks.hdfs.channel = ch2**** > > collector2.sinks.hdfs.type= hdfs**** > > > collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata/exec/%Y/%m/%d/%H > **** > > collector2.sinks.hdfs.batchsize=50000**** > > collector2.sinks.hdfs.runner.type=polling**** > > collector2.sinks.hdfs.runner.polling.interval = 1**** > > collector2.sinks.hdfs.hdfs.rollInterval = 120**** > > collector2.sinks.hdfs.hdfs.rollSize =0**** > > collector2.sinks.hdfs.hdfs.rollCount = 300000**** > > collector2.sinks.hdfs.hdfs.fileType=DataStream**** > > collector2.sinks.hdfs.hdfs.round =true**** > > collector2.sinks.hdfs.hdfs.roundValue = 10**** > > collector2.sinks.hdfs.hdfs.roundUnit = minute****
-
答复: 答复: 答复: HDFS SINK PerformacneShara Shi 2012-08-28, 05:08
HI Anchlia ,
If I use hadoop fs �put xxx xxx , the performance is ok much faster than flume’s . Regards Shara 发件人: Mohit Anchlia [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月28日 12:49 收件人: [EMAIL PROTECTED] 主题: Re: 答复: 答复: HDFS SINK Performacne Do you get better performance when you directly write to the cluster? Can you perform some tests writing to cluster directly and compare? On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <[EMAIL PROTECTED]> wrote: Hi Denny It is 20MB /min , I confirmed I sent data from avro-client from local to flume agent , I really got 20MB/min So I try to find out the reason why. Regards Shara 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月28日 11:02 收件人: [EMAIL PROTECTED] 主题: Re: 答复: HDFS SINK Performacne 20MB/min or 20MB/sec? I doubt that it may have presentation mistake. Can you confirm it? -Regards Denny Ye 2012/8/28 Shara Shi <[EMAIL PROTECTED]> Hi Denny The throughput is 45MB/sec is OK for me . But I just got 20M / Minutes What’s wrong with my configuration? Regards Shara 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月27日 20:05 收件人: [EMAIL PROTECTED] 主题: Re: HDFS SINK Performacne hi Shara, You are using MemoryChannel as repository. I tested it with outcomes: 45MB/sec without full GC in local updated code. Is this your goal? or more high throughput? -Regards Denny Ye 2012/8/27 Shara Shi <[EMAIL PROTECTED]> Hi All, Whatever I have tuned parameters of hdfs sink, It can’t get higher performance over than 20MB per minutes. Is that normal? I think it is weird. How can I improve it Regards Ruihong Shi ========================================= # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # Define a memory channel called ch1 on collector1 collector2.channels.ch2.type = memory collector2.channels.ch2.capacity=500000 collector2.channels.ch2.keep-alive=1 # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414 <http://0.0.0.0:41414/> . Connect it to channel ch1. collector2.sources.avro-source1.channels = ch2 collector2.sources.avro-source1.type = avro collector2.sources.avro-source1.bind = 0.0.0.0 collector2.sources.avro-source1.port = 41415 collector2.sources.avro-soruce1.threads = 10 # Define a hdfs sink collector2.sinks.hdfs.channel = ch2 collector2.sinks.hdfs.type= hdfs collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata /exec/%Y/%m/%d/%H collector2.sinks.hdfs.batchsize=50000 collector2.sinks.hdfs.runner.type=polling collector2.sinks.hdfs.runner.polling.interval = 1 collector2.sinks.hdfs.hdfs.rollInterval = 120 collector2.sinks.hdfs.hdfs.rollSize =0 collector2.sinks.hdfs.hdfs.rollCount = 300000 collector2.sinks.hdfs.hdfs.fileType=DataStream collector2.sinks.hdfs.hdfs.round =true collector2.sinks.hdfs.hdfs.roundValue = 10 collector2.sinks.hdfs.hdfs.roundUnit = minute collector2.sinks.hdfs.hdfs.threadsPoolSize = 10 collector2.sinks.hdfs.hdfs.rollTimerPoolSize = 10 # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate.
-
Re: 答复: 答复: HDFS SINK PerformacnePatrick Wendell 2012-08-28, 05:11
Hey,
Can you let us know what rate data is arriving at collector2 at? How many events/second and bytes/second, roughly? Also, why is your batch size so large? I'm not sure, but I think it may wait until it has received batchSize events before it decides to flush them to HDFS... so this may create strange results depending on how many events/second you have. - Patrick On Mon, Aug 27, 2012 at 9:48 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > Do you get better performance when you directly write to the cluster? Can > you perform some tests writing to cluster directly and compare? > > > On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <[EMAIL PROTECTED]> wrote: >> >> Hi Denny >> >> >> >> It is 20MB /min , I confirmed >> >> I sent data from avro-client from local to flume agent , I really got >> 20MB/min >> >> So I try to find out the reason why. >> >> >> >> Regards >> >> Shara >> >> 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] >> 发送时间: 2012年8月28日 11:02 >> 收件人: [EMAIL PROTECTED] >> 主题: Re: 答复: HDFS SINK Performacne >> >> >> >> 20MB/min or 20MB/sec? >> >> I doubt that it may have presentation mistake. Can you confirm it? >> >> >> >> -Regards >> >> Denny Ye >> >> 2012/8/28 Shara Shi <[EMAIL PROTECTED]> >> >> Hi Denny >> >> >> >> The throughput is 45MB/sec is OK for me . >> >> But I just got 20M / Minutes >> >> What’s wrong with my configuration? >> >> >> >> Regards >> >> Shara >> >> >> >> >> >> 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] >> 发送时间: 2012年8月27日 20:05 >> 收件人: [EMAIL PROTECTED] >> 主题: Re: HDFS SINK Performacne >> >> >> >> hi Shara, >> >> You are using MemoryChannel as repository. I tested it with outcomes: >> 45MB/sec without full GC in local updated code. Is this your goal? or more >> high throughput? >> >> >> >> -Regards >> >> Denny Ye >> >> 2012/8/27 Shara Shi <[EMAIL PROTECTED]> >> >> Hi All, >> >> >> >> Whatever I have tuned parameters of hdfs sink, It can’t get higher >> performance over than 20MB per minutes. >> >> Is that normal? I think it is weird. >> >> How can I improve it >> >> >> >> Regards >> >> Ruihong Shi >> >> =========================================>> >> >> >> # or more contributor license agreements. See the NOTICE file >> >> # distributed with this work for additional information >> >> # regarding copyright ownership. The ASF licenses this file >> >> # to you under the Apache License, Version 2.0 (the >> >> # "License"); you may not use this file except in compliance >> >> # with the License. You may obtain a copy of the License at >> >> # >> >> # http://www.apache.org/licenses/LICENSE-2.0 >> >> # >> >> # Unless required by applicable law or agreed to in writing, >> >> # software distributed under the License is distributed on an >> >> # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY >> >> # KIND, either express or implied. See the License for the >> >> # specific language governing permissions and limitations >> >> # under the License. >> >> >> >> # Define a memory channel called ch1 on collector1 >> >> collector2.channels.ch2.type = memory >> >> collector2.channels.ch2.capacity=500000 >> >> collector2.channels.ch2.keep-alive=1 >> >> >> >> >> >> # Define an Avro source called avro-source1 on agent1 and tell it >> >> # to bind to 0.0.0.0:41414. Connect it to channel ch1. >> >> collector2.sources.avro-source1.channels = ch2 >> >> collector2.sources.avro-source1.type = avro >> >> collector2.sources.avro-source1.bind = 0.0.0.0 >> >> collector2.sources.avro-source1.port = 41415 >> >> collector2.sources.avro-soruce1.threads = 10 >> >> >> >> >> >> # Define a hdfs sink >> >> collector2.sinks.hdfs.channel = ch2 >> >> collector2.sinks.hdfs.type= hdfs >> >> >> collector2.sinks.hdfs.hdfs.path=hdfs://namenode:8020/user/root/flume/webdata/exec/%Y/%m/%d/%H >> >> collector2.sinks.hdfs.batchsize=50000 >> >> collector2.sinks.hdfs.runner.type=polling >> >> collector2.sinks.hdfs.runner.polling.interval = 1 >> >> collector2.sinks.hdfs.hdfs.rollInterval = 120 >> >> collector2.sinks.hdfs.hdfs.rollSize =0
-
答复: 答复: 答复: HDFS SINK PerformacneShara Shi 2012-08-28, 05:42
HI Patrick
I try to send a data file over than 200MB via flume avro-client to a flume agent with HDFS sink. I think most of events are in Channel(memory) , but flush to hdsf(disc) is very slow. If I use hadoop fs -put xxx xxx , the performance is ok just use server seconds. My event is big over than 1k. I use flume-1.2.0 and my hadoop cluster is CDH4. Regards Shara -----邮件原件----- 发件人: Patrick Wendell [mailto:[EMAIL PROTECTED]] 发送时间: 2012年8月28日 13:11 收件人: [EMAIL PROTECTED] 主题: Re: 答复: 答复: HDFS SINK Performacne Hey, Can you let us know what rate data is arriving at collector2 at? How many events/second and bytes/second, roughly? Also, why is your batch size so large? I'm not sure, but I think it may wait until it has received batchSize events before it decides to flush them to HDFS... so this may create strange results depending on how many events/second you have. - Patrick On Mon, Aug 27, 2012 at 9:48 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > Do you get better performance when you directly write to the cluster? > Can you perform some tests writing to cluster directly and compare? > > > On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <[EMAIL PROTECTED]> wrote: >> >> Hi Denny >> >> >> >> It is 20MB /min , I confirmed >> >> I sent data from avro-client from local to flume agent , I really got >> 20MB/min >> >> So I try to find out the reason why. >> >> >> >> Regards >> >> Shara >> >> 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] >> 发送时间: 2012年8月28日 11:02 >> 收件人: [EMAIL PROTECTED] >> 主题: Re: 答复: HDFS SINK Performacne >> >> >> >> 20MB/min or 20MB/sec? >> >> I doubt that it may have presentation mistake. Can you confirm it? >> >> >> >> -Regards >> >> Denny Ye >> >> 2012/8/28 Shara Shi <[EMAIL PROTECTED]> >> >> Hi Denny >> >> >> >> The throughput is 45MB/sec is OK for me . >> >> But I just got 20M / Minutes >> >> What’s wrong with my configuration? >> >> >> >> Regards >> >> Shara >> >> >> >> >> >> 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] >> 发送时间: 2012年8月27日 20:05 >> 收件人: [EMAIL PROTECTED] >> 主题: Re: HDFS SINK Performacne >> >> >> >> hi Shara, >> >> You are using MemoryChannel as repository. I tested it with outcomes: >> 45MB/sec without full GC in local updated code. Is this your goal? or >> more high throughput? >> >> >> >> -Regards >> >> Denny Ye >> >> 2012/8/27 Shara Shi <[EMAIL PROTECTED]> >> >> Hi All, >> >> >> >> Whatever I have tuned parameters of hdfs sink, It can’t get higher >> performance over than 20MB per minutes. >> >> Is that normal? I think it is weird. >> >> How can I improve it >> >> >> >> Regards >> >> Ruihong Shi >> >> =========================================>> >> >> >> # or more contributor license agreements. See the NOTICE file >> >> # distributed with this work for additional information >> >> # regarding copyright ownership. The ASF licenses this file >> >> # to you under the Apache License, Version 2.0 (the >> >> # "License"); you may not use this file except in compliance >> >> # with the License. You may obtain a copy of the License at >> >> # >> >> # http://www.apache.org/licenses/LICENSE-2.0 >> >> # >> >> # Unless required by applicable law or agreed to in writing, >> >> # software distributed under the License is distributed on an >> >> # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY >> >> # KIND, either express or implied. See the License for the >> >> # specific language governing permissions and limitations >> >> # under the License. >> >> >> >> # Define a memory channel called ch1 on collector1 >> >> collector2.channels.ch2.type = memory >> >> collector2.channels.ch2.capacity=500000 >> >> collector2.channels.ch2.keep-alive=1 >> >> >> >> >> >> # Define an Avro source called avro-source1 on agent1 and tell it >> >> # to bind to 0.0.0.0:41414. Connect it to channel ch1. >> >> collector2.sources.avro-source1.channels = ch2 >> >> collector2.sources.avro-source1.type = avro >> >> collector2.sources.avro-source1.bind = 0.0.0.0 >> >> collector2.sources.avro-source1.port = 41415
-
Re: 答复: 答复: 答复: HDFS SINK PerformacneBrock Noland 2012-08-28, 11:47
Do you have a batch size configured for HDFSSink?
On Tue, Aug 28, 2012 at 12:42 AM, Shara Shi <[EMAIL PROTECTED]> wrote: > HI Patrick > > I try to send a data file over than 200MB via flume avro-client to a flume > agent with HDFS sink. > I think most of events are in Channel(memory) , but flush to hdsf(disc) is > very slow. > If I use hadoop fs -put xxx xxx , the performance is ok just use server > seconds. > > My event is big over than 1k. > I use flume-1.2.0 and my hadoop cluster is CDH4. > > Regards > Shara > > -----邮件原件----- > 发件人: Patrick Wendell [mailto:[EMAIL PROTECTED]] > 发送时间: 2012年8月28日 13:11 > 收件人: [EMAIL PROTECTED] > 主题: Re: 答复: 答复: HDFS SINK Performacne > > Hey, > > Can you let us know what rate data is arriving at collector2 at? How many > events/second and bytes/second, roughly? > > Also, why is your batch size so large? I'm not sure, but I think it may > wait > until it has received batchSize events before it decides to flush them to > HDFS... so this may create strange results depending on how many > events/second you have. > > - Patrick > > On Mon, Aug 27, 2012 at 9:48 PM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > Do you get better performance when you directly write to the cluster? > > Can you perform some tests writing to cluster directly and compare? > > > > > > On Mon, Aug 27, 2012 at 8:19 PM, Shara Shi <[EMAIL PROTECTED]> > wrote: > >> > >> Hi Denny > >> > >> > >> > >> It is 20MB /min , I confirmed > >> > >> I sent data from avro-client from local to flume agent , I really got > >> 20MB/min > >> > >> So I try to find out the reason why. > >> > >> > >> > >> Regards > >> > >> Shara > >> > >> 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] > >> 发送时间: 2012年8月28日 11:02 > >> 收件人: [EMAIL PROTECTED] > >> 主题: Re: 答复: HDFS SINK Performacne > >> > >> > >> > >> 20MB/min or 20MB/sec? > >> > >> I doubt that it may have presentation mistake. Can you confirm it? > >> > >> > >> > >> -Regards > >> > >> Denny Ye > >> > >> 2012/8/28 Shara Shi <[EMAIL PROTECTED]> > >> > >> Hi Denny > >> > >> > >> > >> The throughput is 45MB/sec is OK for me . > >> > >> But I just got 20M / Minutes > >> > >> What’s wrong with my configuration? > >> > >> > >> > >> Regards > >> > >> Shara > >> > >> > >> > >> > >> > >> 发件人: Denny Ye [mailto:[EMAIL PROTECTED]] > >> 发送时间: 2012年8月27日 20:05 > >> 收件人: [EMAIL PROTECTED] > >> 主题: Re: HDFS SINK Performacne > >> > >> > >> > >> hi Shara, > >> > >> You are using MemoryChannel as repository. I tested it with > outcomes: > >> 45MB/sec without full GC in local updated code. Is this your goal? or > >> more high throughput? > >> > >> > >> > >> -Regards > >> > >> Denny Ye > >> > >> 2012/8/27 Shara Shi <[EMAIL PROTECTED]> > >> > >> Hi All, > >> > >> > >> > >> Whatever I have tuned parameters of hdfs sink, It can’t get higher > >> performance over than 20MB per minutes. > >> > >> Is that normal? I think it is weird. > >> > >> How can I improve it > >> > >> > >> > >> Regards > >> > >> Ruihong Shi > >> > >> =========================================> >> > >> > >> > >> # or more contributor license agreements. See the NOTICE file > >> > >> # distributed with this work for additional information > >> > >> # regarding copyright ownership. The ASF licenses this file > >> > >> # to you under the Apache License, Version 2.0 (the > >> > >> # "License"); you may not use this file except in compliance > >> > >> # with the License. You may obtain a copy of the License at > >> > >> # > >> > >> # http://www.apache.org/licenses/LICENSE-2.0 > >> > >> # > >> > >> # Unless required by applicable law or agreed to in writing, > >> > >> # software distributed under the License is distributed on an > >> > >> # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY > >> > >> # KIND, either express or implied. See the License for the > >> > >> # specific language governing permissions and limitations > >> > >> # under the License. > >> > >> > >> > >> # Define a memory channel called ch1 on collector1 Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ |