Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> File Channel Best Practice

Copy link to this message
RE: File Channel Best Practice
We co-locate our flume agents on our data nodes in order to have access to many 'spindles' for the file channels. We have a small cluster (10 nodes) so these are also our task tracker nodes and we haven't seen any huge performance issues.

For reference, our typical event ingestion rate is between 2k and 5k events per second under 'normal' production load. I recently had to backfill a couple of weeks of web logs though, and took the opportunity to examine max throughput rates and how heavy MR load affected things. During that 'test' we stabilized at about 12K events per second written to HDFS, that was a single agent using 2 HDFS sinks taking from one file channel. As far as I could tell my bottleneck was in the Avro hop between my collector and writer agents, not in the HDFS sinks. When we had all MR slots used by large batch jobs for extended amounts of time the event throughput degraded to about 3500 events/sec.

I know these are just anecdotal data points, but wanted to share my experience with flume agents located on the actual data/task nodes themselves. I have done very little optimization aside from separating the file channel data/log directories onto separate drives.

-Paul Chavez

From: Devin Suiter RDX [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 17, 2013 8:30 AM
Subject: File Channel Best Practice


There has been a lot of discussion about file channel speed today, and I have had a dilemma I was hoping for some feedback on, since the topic is hot.

Regarding this:

1) You are only using a single disk for file channel and it looks like a single disk for both checkpoint and data directories therefore throughput is going to be extremely slow."

How do you solve in a practical sense the requirement for file channel to have a range of disks for best R/W speed, yet still have network visibility to source data sources and the Hadoop cluster at the same time?

It seems like for production file channel implementation, the best solution is to give Flume a dedicated server somewhere near the edge with a JBOD pile properly mounted and partitioned. But that adds to implementation cost.

The alternative seems to be to run Flume on a  physical Cloudera Manager SCM server that has some extra disks, or run Flume agents concurrent with datanode processes on worker nodes, but those don't seem good to do, especially piggybacking on worker nodes, and file channel > HDFS will compound the issue...

I know the namenode should definitely not be involved.

I suppose you could virtualize a few servers on a properly networked host and a fast SANS/NAS connection and get by ok, but that will merge your parallelization at some point...

Any ideas on the subject?

Devin Suiter
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com<http://www.rdx.com/>