Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> Flume netcat source related problems


Copy link to this message
-
Flume netcat source related problems
Hi

I encountered an problem in my scenario with netcat source. Setup is
Host A: Netcat source -file channel -avro sink
Host B: Avro source - file channel - HDFS sink
But to simplify it I have created a single agent with "Netcat Source"
and "file roll sink"*
*It is *:
*Host A: Netcat source - file channel - File_roll sink

*Problem*:
1. To simulate the our production scenario. I have created a script
which runs for 15 sec and in the
while loop writes requests netcat source on a given port. For a large
value of the sleep events are
delivered correctly to the destination. But as I reduce the delay events
are given to the source but they
are not delivered to the destination. e.g. I write 9108 records within
15 sec using script and only 1708
got delivered. And I don't get any exception. If it is flow control
related problem then I should have seen
some exception in agent logs. But with file channel and huge disk space,
is it a problem?

***Machine Configuration:*
RAM : 8 GB
JVM : 200 MB
CPU: 2.0 GHz Quad core processor

*Flume Agent Confi**guration*
adServerAgent.sources = netcatSource
adServerAgent.channels = fileChannel memoryChannel
adServerAgent.sinks = fileSink

# For each one of the sources, the type is defined
adServerAgent.sources.netcatSource.type = netcat
adServerAgent.sources.netcatSource.bind = 10.0.17.231
adServerAgent.sources.netcatSource.port = 55355

# The channel can be defined as follows.
adServerAgent.sources.netcatSource.channels = fileChannel
#adServerAgent.sources.netcatSource.channels = memoryChannel

# Each sink's type must be defined
adServerAgent.sinks.fileSink.type = file_roll
adServerAgent.sinks.fileSink.sink.directory = /root/flume/flume_sink

#Specify the channel the sink should use
#adServerAgent.sinks.fileSink.channel = memoryChannel
adServerAgent.sinks.fileSink.channel = fileChannel

adServerAgent.channels.memoryChannel.type =memory
adServerAgent.channels.memoryChannel.capacity = 100000
adServerAgent.channels.memoryChannel.transactionCapacity = 10000

adServerAgent.channels.fileChannel.type=file
adServerAgent.channels.fileChannel.dataDirs=/root/jagadish/flume_channel1/dataDir3
adServerAgent.channels.fileChannel.checkpointDir=/root/jagadish/flume_channel1/checkpointDir3**

*Script  snippet being used:*
...
eval
{
         local $SIG{ALRM} = sub { die "alarm\n"; };
         alarm $TIMEOUT;
         my $i=0;
         my $str = "";
         my $counter=1;
         while(1)
         {
                         $str = "";
                         for($i=0; $i < $NO_ELE_PER_ROW; $i++)
                         {
                                 $str .= $counter."\t";
                                 $counter++;
                         }
                         chop($str);
                         #print $socket "$str\n";
                         $socket->send($str."\n") or die "Didn't send";

                         if($? != 0)
                         {
                                 print "Failed for $str \n";
                         }
                         print "$str\n";
                         Time::HiRes::usleep($SLEEP_TIME);
         }
         alarm 0;
};
if ($@) {
......

- Script is working fine as for the very large delay all events are
getting transmitted correctly.*
*- Same problem occurs with memory channel too but with lower values of
sleep.*

**Problem 2:*
-- With this setup I am getting very low throughput i.e. I am able to
transfer only ~ 1 KB/sec data
to the destination file sink. Similar performance was achieved using
HDFS sink.
-- I had tried increasing batch sizes in my original scenario without
much gain in throughput.
-- I had seen using 'tail -F' as source almost 10 times better throughput.
-- Is there any tunable parameter for netcat source?

Please help me in above 2 cases - i)netcat source use  cases
ii) Typical flume's expected throughput with file channel and file/HDFS
sink on the single machine.

Regards,
Jagadish
+
Juhani Connolly 2012-09-04, 11:10
+
Jagadish Bihani 2012-09-05, 06:05
+
Steve Johnson 2012-09-05, 14:45
+
Juhani Connolly 2012-09-06, 02:23
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB