Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - flume tail source problem and performance


Copy link to this message
-
Re: flume tail source problem and performance
Alexander Alten-Lorenz 2013-02-04, 07:39
Hi Andy,

I mentioned more a own program / script to parse the data (instead tail -*) to have some control about the contents. Note, when a flume agent will be restarted, the marker for tail will be lost too. This comes from tail itself, flume hasn't a control about.

- Alex

On Feb 4, 2013, at 8:33 AM, 周梦想 <[EMAIL PROTECTED]> wrote:

> Hi Alex,
>
> You mean I write a script to check the directories?
> [zhouhh@Hadoop46 ag1]$ pwd
> /tmp/flume-zhouhh/agent/ag1
> [zhouhh@Hadoop46 ag1]$ ls
> dfo_error  dfo_import  dfo_logged  dfo_sending  dfo_writing  done  error
> import  logged  sending  sent  writing
>
> how to check to avoid lost data and disable resend data ? clean sending dir?
>
> thanks!
> Andy
>
> 2013/1/29 Alexander Alten-Lorenz <[EMAIL PROTECTED]>
>
>> Hi,
>>
>> you could use tail -F, but this depends on the external source. Flume
>> hasn't control about. You can write your own script and include this.
>>
>> What's the content of:
>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>
>> - Alex
>>
>> On Jan 29, 2013, at 8:24 AM, 周梦想 <[EMAIL PROTECTED]> wrote:
>>
>>> hello,
>>> 1. I want to tail a log source and write it to hdfs. below is configure:
>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>>
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>
>>>
>>> I found if I restart the agent node, it will resend the content of
>> game.log
>>> to collector. There are some solutions to send logs from where I haven't
>>> sent before? Or I have to make a mark myself or remove the logs manually
>>> when restart the agent node?
>>>
>>> 2. I tested performance of flume, and found it's a bit slow.
>>> if I using configure as above, there are only 50MB/minute.
>>> I changed the configure to below:
>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>> agentDFOSink("hadoop48",35853);
>>>
>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>>
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>
>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>> 100MB/minute.
>>>
>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>>
>>> someone give me any ideas?
>>>
>>> thanks!
>>>
>>> Andy
>>
>> --
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>>

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF