Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> flume tail source problem and performance


+
周梦想 2013-01-29, 07:24
+
Alexander Alten-Lorenz 2013-01-29, 07:29
+
Jeong-shik Jang 2013-01-29, 07:41
+
周梦想 2013-02-04, 07:27
+
Jeong-shik Jang 2013-02-04, 07:47
+
周梦想 2013-02-04, 08:07
+
Jeong-shik Jang 2013-02-04, 08:13
+
GuoWei 2013-02-04, 11:46
+
周梦想 2013-02-06, 02:47
+
周梦想 2013-02-04, 07:33
Copy link to this message
-
Re: flume tail source problem and performance
Hi Andy,

I mentioned more a own program / script to parse the data (instead tail -*) to have some control about the contents. Note, when a flume agent will be restarted, the marker for tail will be lost too. This comes from tail itself, flume hasn't a control about.

- Alex

On Feb 4, 2013, at 8:33 AM, 周梦想 <[EMAIL PROTECTED]> wrote:

> Hi Alex,
>
> You mean I write a script to check the directories?
> [zhouhh@Hadoop46 ag1]$ pwd
> /tmp/flume-zhouhh/agent/ag1
> [zhouhh@Hadoop46 ag1]$ ls
> dfo_error  dfo_import  dfo_logged  dfo_sending  dfo_writing  done  error
> import  logged  sending  sent  writing
>
> how to check to avoid lost data and disable resend data ? clean sending dir?
>
> thanks!
> Andy
>
> 2013/1/29 Alexander Alten-Lorenz <[EMAIL PROTECTED]>
>
>> Hi,
>>
>> you could use tail -F, but this depends on the external source. Flume
>> hasn't control about. You can write your own script and include this.
>>
>> What's the content of:
>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>
>> - Alex
>>
>> On Jan 29, 2013, at 8:24 AM, 周梦想 <[EMAIL PROTECTED]> wrote:
>>
>>> hello,
>>> 1. I want to tail a log source and write it to hdfs. below is configure:
>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>>
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>
>>>
>>> I found if I restart the agent node, it will resend the content of
>> game.log
>>> to collector. There are some solutions to send logs from where I haven't
>>> sent before? Or I have to make a mark myself or remove the logs manually
>>> when restart the agent node?
>>>
>>> 2. I tested performance of flume, and found it's a bit slow.
>>> if I using configure as above, there are only 50MB/minute.
>>> I changed the configure to below:
>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>> agentDFOSink("hadoop48",35853);
>>>
>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>>
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>
>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>> 100MB/minute.
>>>
>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>>
>>> someone give me any ideas?
>>>
>>> thanks!
>>>
>>> Andy
>>
>> --
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>>

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB