Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - flume tail source problem and performance


Copy link to this message
-
flume tail source problem and performance
周梦想 2013-01-29, 07:24
hello,
1. I want to tail a log source and write it to hdfs. below is configure:
config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
agentDFOSink("hadoop48",35853) ;]
config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
agentDFOSink("hadoop48",35853) ;]
config [co1, collectorSource( 35853 ),  [collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
I found if I restart the agent node, it will resend the content of game.log
to collector. There are some solutions to send logs from where I haven't
sent before? Or I have to make a mark myself or remove the logs manually
when restart the agent node?

2. I tested performance of flume, and found it's a bit slow.
if I using configure as above, there are only 50MB/minute.
I changed the configure to below:
ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
agentDFOSink("hadoop48",35853);

config [co1, collectorSource( 35853 ), [collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]

I sent 300MB log, it will spent about 3 minutes, so it's about 100MB/minute.

while I send the log from ag1 to co1 via scp, It's about 30MB/second.

someone give me any ideas?

thanks!

Andy