Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Need for UDP / Multicast Source

Copy link to this message
Re: Need for UDP / Multicast Source
Ok, I'm still struggling with this a bit.  Here's what I've currently got going.

In order to make it easier to check what I am and am not receiving, I've narrowed the logs that I store in HDFS down to those originating from a single host (cp1044.wikimedia.org).  Each host generates contiguous sequence numbers for each log line.  I can use the sequence number to make sure I'm not missing lines from a host.

On another nearby node, I started a process to store all of the log lines originating from this cp1044.  I then started the Flume agent and waited a 3 minutes for it to roll files 3 times.  I currently have 4 HDFS sinks going, so this created a total of 12 files.  I got the files out of HDFS, and then sorted on their sequence numbers to gain the first and last sequence number in this set of files.  

I took those two border sequence numbers and extracted all of the log lines generated by cp1044 on the nearby host (not using Flume).  I should be able to compare the number of lines here with the number of lines in the 12 files I extracted from HDFS and Flume.  If they are the same, then Flume and UDPSource is working!

Flume saved 19451 events to HDFS, and the number of raw events recorded outside of Flume and HDFS was 78176.  I'm up to about 25% of data!  Better but still not good enough. :(

This was for about 3 minutes of data, so for a single host, this shouldn't be more than 500 events per second.  I must be doing something really wrong on the Flume tweaky side of things, eh?  Any more ideas?


P.S.  YOU GUYS ARE SO HELPFUL.  Thanks so much for everything thus far.
On Jan 17, 2013, at 10:34 AM, Andrew Otto <[EMAIL PROTECTED]> wrote:

>> with UDP there's no guaranty that the data will reach destination.
> True, but I'm experimenting with using Flume as a replacement for a system that is already in place.  I actually got the numbers I listed below by grabbing data directly off of the UDP stream and saving them to a file on local disk.  Its possible that UDP data is getting lost in the network somewhere, but if that were the case I wouldn't know about it.  I am comparing Flume's performance to a single process writing to a local disk.