Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Flume netcat source related problems


+
Jagadish Bihani 2012-09-04, 10:50
+
Juhani Connolly 2012-09-04, 11:10
+
Jagadish Bihani 2012-09-05, 06:05
+
Steve Johnson 2012-09-05, 14:45
Copy link to this message
-
Re: Flume netcat source related problems
Juhani Connolly 2012-09-06, 02:23
Would you be able to attach using jmx with jconsole(or similar) and
check out the numbers you are getting for events delivered/number of
batches(there are beans exposing these values for sink/channel/source)?

If I can, I'll try to recreate your scenario when I have some time, but
that's not happening right now, sorry

On 09/05/2012 03:05 PM, Jagadish Bihani wrote:
> Hi Juhani
>
> Thanks for the inputs.
> I did following changes:
> -- I sent my string event to socket with batches of  1000 & 10000 of
> such events.
> -- I have also started using DEBUG log level for flume agent.
> -- I have also increased max-line-length property of netcat source
> from default 512.
> But both problems remained. Events got lost without any exception.
> And performance also didn't get improve much (from 1 KB/sec now it's
> 1.3 KB/sec apprx).
> Is there anything else to be considered?
>
> Regards,
> Jagadish
>
>
> On 09/04/2012 04:40 PM, Juhani Connolly wrote:
>> Hi Jagadish,
>>
>> NetcatSource doesn't use any batching when receiving events. It
>> writes one event at a time, and that translates in the FileChannel to
>> a flush to disk, so when you're writing many, your disk just won't
>> keep up. One way to improve this is to use separate physical disks
>> for your checkpoint/data directories.
>>
>> TailSource used to have the same problem until we added batching to
>> it. By a cursory examination of NetcatSource, it looks to me like you
>> can also force some batching by sending multiple lines in each
>> socket->send.
>>
>> As to the first problem with lines going missing, I'm not entirely
>> sure as I can't dive deeply into the source right now. I wouldn't be
>> surprised if it's some kind of congestion problem and lack of
>> logging(or your log levels are just too high, try switching them to
>> INFO or DEBUG?) that will be resolved once you get the throughput up.
>>
>> On 09/04/2012 07:50 PM, Jagadish Bihani wrote:
>>> Hi
>>>
>>> I encountered an problem in my scenario with netcat source. Setup is
>>> Host A: Netcat source -file channel -avro sink
>>> Host B: Avro source - file channel - HDFS sink
>>> But to simplify it I have created a single agent with "Netcat
>>> Source" and "file roll sink"*
>>> *It is *:
>>> *Host A: Netcat source - file channel - File_roll sink
>>>
>>> *Problem*:
>>> 1. To simulate the our production scenario. I have created a script
>>> which runs for 15 sec and in the
>>> while loop writes requests netcat source on a given port. For a
>>> large value of the sleep events are
>>> delivered correctly to the destination. But as I reduce the delay
>>> events are given to the source but they
>>> are not delivered to the destination. e.g. I write 9108 records
>>> within 15 sec using script and only 1708
>>> got delivered. And I don't get any exception. If it is flow control
>>> related problem then I should have seen
>>> some exception in agent logs. But with file channel and huge disk
>>> space, is it a problem?
>>>
>>> *Machine Configuration:*
>>> RAM : 8 GB
>>> JVM : 200 MB
>>> CPU: 2.0 GHz Quad core processor
>>>
>>> *Flume Agent Confi**guration*
>>> adServerAgent.sources = netcatSource
>>> adServerAgent.channels = fileChannel memoryChannel
>>> adServerAgent.sinks = fileSink
>>>
>>> # For each one of the sources, the type is defined
>>> adServerAgent.sources.netcatSource.type = netcat
>>> adServerAgent.sources.netcatSource.bind = 10.0.17.231
>>> adServerAgent.sources.netcatSource.port = 55355
>>>
>>> # The channel can be defined as follows.
>>> adServerAgent.sources.netcatSource.channels = fileChannel
>>> #adServerAgent.sources.netcatSource.channels = memoryChannel
>>>
>>> # Each sink's type must be defined
>>> adServerAgent.sinks.fileSink.type = file_roll
>>> adServerAgent.sinks.fileSink.sink.directory = /root/flume/flume_sink
>>>
>>> #Specify the channel the sink should use
>>> #adServerAgent.sinks.fileSink.channel = memoryChannel
>>> adServerAgent.sinks.fileSink.channel = fileChannel
>>>
>>> adServerAgent.channels.memoryChannel.type =memory