I tried to do a proof of concept with Netcat source with 1.3.0 or 1.3.1 and it failed miserably - I was able to make a change to improve it's performance, arguably a bug fix (I think socket acknowledgement it was expecting) but Netcat source was still my bottle neck.
Have you read the blog on performance tuning - I'm not sure where you are in your flume implementation but I found it helpful. https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 & https://blogs.apache.org/flume/entry/apache_flume_filechannel
Since you need persistent storage I believe your only option still is the file channel. To get the performance you need you'll need dedicated disks for the queue and write ahead log - I had good luck with a solid state drive. With a single disk drive performance was awful.
To get the throughput I wanted with compression I had one source tied to 6 file channels with compression on each channel. Perhaps there is a better way but that is how I got it working.
We also configured Forced Write Back on centos boxes serving as flume agents. That was an optimization our IT Operations team made that helped throughput. That is a skill I don't have but I believe it does put you at risk of data loss if the server fails because it does more caching before flushing to disk.
We are currently fluming between 40 and 50 billion log lines per day (10-12TB of data) from 14 servers "collector tier" sinking the data to 8 servers in the "storage tier" that writes to HDFS (MapR's implementation) with problem. We had no problem with 1/2 the servers however we configured fail over and paired up the servers for this purpose. Which by the way works flawlessly - able to pull one server out for maintenance and add back in no problem.
Here are some high level points to our implementation.
1. Instead of netcat source I made use of the Embedded Agent - When I created an event to flume (EventBuilder.withBody(payload, hdrs)) I put a configurable number of log lines in the payload, usually 200 lines of log data. Ultimately I went away from text data all together and serialized 200 avro "log objects" as a avro data file byte array and that was my payload.
2. Keep your batch size large. I set mine to 50 - so 10,000 log lines (or objects) in a single batch.
3. You will get duplicates so be prepared to either customize flume to prevent duplicates (our solution) or write map reduce jobs to remove duplicates.
From: Andrew Ehrlich [[EMAIL PROTECTED]]
Sent: Thursday, March 27, 2014 1:07 PM
To: [EMAIL PROTECTED]
Subject: Re: Fastest way to get data into flume?
What about having more than one flume agent?
You could have two agents that read the small messages and sink to HDFS, or two agents that read the messages, serialize them, and send them to a third agent which sinks them into HDFS.
On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
I have a fair bit of data continually being created in the form of smallish messages (a few hundred bytes), which needs to enter flume, and eventually sink into HDFS.
I need to be sure that the data lands in persistent storage and won't be lost, but otherwise throughput isn't important. It just needs to be fast enough to not back up.
I'm running into a bottleneck in the initial ingestion of data.
I've tried the netcat source, and the thrift source but both have capped out at a thousand or so records per second.
Batching up the thrift api items into sets of 10 and using appendBatch is a pretty large speedup, but still not enough.
Here's a gist of my ruby test script, and some example runs, and my config.
1. Are there any obvious performance changes I can do to speed up ingestion?
2. How fast can flume reasonably go? Should I switch my source to be something else that's faster? What?
3. Is there a better tool for this kind of task? (rapid, safe ingestion small messages).
This email and any files included with it may contain privileged,
proprietary and/or confidential information that is for the sole use
of the intended recipient(s). Any disclosure, copying, distribution,
posting, or use of the information contained in or attached to this
email is prohibited unless permitted by the sender. If you have
received this email in error, please immediately notify the sender
via return email, telephone, or fax and destroy this original transmission
and its included files without reading or saving it in any manner.