I have a fair bit of data continually being created in the form of smallish messages (a few hundred bytes), which needs to enter flume, and eventually sink into HDFS.
I need to be sure that the data lands in persistent storage and won't be lost, but otherwise throughput isn't important. It just needs to be fast enough to not back up.
I'm running into a bottleneck in the initial ingestion of data.
I've tried the netcat source, and the thrift source but both have capped out at a thousand or so records per second.
Batching up the thrift api items into sets of 10 and using appendBatch is a pretty large speedup, but still not enough.
Here's a gist of my ruby test script, and some example runs, and my config.
https://gist.github.com/cschneid/9792305 1. Are there any obvious performance changes I can do to speed up ingestion? 2. How fast can flume reasonably go? Should I switch my source to be something else that's faster? What? 3. Is there a better tool for this kind of task? (rapid, safe ingestion small messages).
You could have two agents that read the small messages and sink to HDFS, or two agents that read the messages, serialize them, and send them to a third agent which sinks them into HDFS. On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider < [EMAIL PROTECTED]> wrote:
My specific situation is a bit more complex than I let on initially.
Flume running multiple agents will absolutely be able to scale to the size we need for production. But since our system is time-based, waiting for real-world measurements to arrive, we have a simulation layer making convincing real data to be pushed in for development & demos. (ie, create events at 1000x accelerated time, so we can see the effects of our change without waiting weeks).
So we have a VM (Vagrant + Virtualbox) running HDFS & Flume on our laptops as we're doing development. I suppose memory channel is fine in this case, since it's all test data, but maximum single-agent speed is needed to support the higher time accelerations I want.
Unfortunately, our production system demands a horizontal scaling system (flume is great), and our dev environment would be best with a vertically scaling system (not as much flume's goal from what I can tell).
Are there any tricks / tweaks that can get single-agent speeds up? What's the fastest (maybe not 100% safe?) source type? Can we minimize the cost of ACKing messages in the source? On Thu, Mar 27, 2014 at 12:10 PM, Mike Keane <[EMAIL PROTECTED]>wrote:
I know I am a bit derailing, but scaling flume and HDFS in single VM is ... well I guess I understand why, but is it a good approach to try to squeeze every bit out of the virtual machine sitting on your laptop especially for hadoop/flume?
Can you stand up small cluster in e.g. AWS if you really want to do high volume perf testing. That should be very simple task for Whirr or CM or ... On Thu, Mar 27, 2014 at 12:34 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
How much data are ingesting per minute or second bases? How many source we are taking here ? What kind of channel are you using currently and what is the memory /storage footprint on the source as well as sink? is it a uniform distribution of traffic? if not, what is the max peak of the data throughput you you expect from a given source?
On Thu, Mar 27, 2014 at 11:07 AM, Andrew Ehrlich <[EMAIL PROTECTED]>wrote:
It looks like you had acks turned on in the config you posted for your netcat source. You might want to try turning them off:
agent1.sources.netcatSource.ack-every-event = false We've gotten up around 1400 events per second on a single netcat source feeding 2 HDFS sinks without any issues (using a memory channel). This is on a live network so we've never tested above that as that's the max throughput of the events we're storing.
Ed On Fri, Mar 28, 2014 at 5:58 AM, Asim Zafir <[EMAIL PROTECTED]> wrote:
Thanks for the netcat ack setting - I'll give that a shot.
As for data ingestion, we're aiming for ~10000 events per second on our development VMs, since we want to be able to run the system in time acceleration (with simulation of incoming data). Each event is ~100-300 bytes.
I've gotten massive speedups with the Thrift source -> Memory -> HDFS when I start batching the Thrift source into larger and larger batches. Since the requirements for consistency are much less during development and demo, it's easy to raise that batch size arbitrarily high, so I think that solution will work for now.
Thank you everybody for the help thinking about this, and clarifying what I'm seeing as reasonable / unreasonable. I think a batched Thrift source will sustain me for now, so I can move on with my project, and loop back when I have better numbers for what I really need on my VM. On Fri, Mar 28, 2014 at 3:23 AM, ed <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext