Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - File Channel  performance and fsync


Copy link to this message
-
Re: File Channel performance and fsync
Brock Noland 2012-10-22, 14:29
In this cae, it's best to think about FileChannel as if it were a database.
Let's pretend we are going to insert 1 million rows. If we committed on
each row, would performance be "good"?  No, everyone knows that when you
are inserting rows in databases, you want to batch 100-1000 rows into a
single commit, if you want "good" performance. (Quoting good because it's
subjective based on the scenario, but in this case we mean lots of
MB/second).

Part of the reason behind this logic is that when a database does a commit,
it does an fsync operation to ensure that all data is written to disk and
that you will not lose data due to a subsequent power loss.

FileChannel behaves *exactly* the same. If your "batch" is only a single
event, file channel will:

write single event
fsync
write single event
fsync

As such, if you want "good" performance with FileChannel, you must increase
your batch size, just like a database. If you have a batchSize of say 100,
then FileChannel will:

write single event 0
write single event 1
...
write single event 99
fsync

Which will result in much "better" performance. It's worth noting that
ExecSource in Flume 1.2, does not have a batchSize and as such each event
is written and then committed. ExecSource in flume 1.3, which we will
release soon, does have a configurable batchSize. If you want to try that
out you can build it from the flume-1.3.0 branch.

Brock

On Mon, Oct 22, 2012 at 8:59 AM, Brock Noland <[EMAIL PROTECTED]> wrote:

>  Which version? 1.2 or trunk?
>
> On Monday, October 22, 2012 at 8:18 AM, Jagadish Bihani wrote:
>
>  Hi
>
> This is the simplistic configuration with which I am getting lower
> performance.
> Even with 2-tier architecture (cat source - avro sinks - avro source- HDFS
> sink)
> I get the similar performance with file channel.
>
> Configuration:
> ========> adServerAgent.sources = avro-collection-source
> adServerAgent.channels = fileChannel
> adServerAgent.sinks = hdfsSink fileSink
>
> # For each one of the sources, the type is defined
> adServerAgent.sources.avro-collection-source.type=exec
> adServerAgent.sources.avro-collection-source.command= cat
> /home/hadoop/file.tsf
>
> # The channel can be defined as follows.
> adServerAgent.sources.avro-collection-source.channels = fileChannel
>
> #Define file sink
> adServerAgent.sinks.fileSink.type = file_roll
> adServerAgent.sinks.fileSink.sink.directory = /home/hadoop/flume_sink*
> *
> adServerAgent.sinks.fileSink.channel = fileChannel
> adServerAgent.channels.fileChannel.type=file
>
> adServerAgent.channels.fileChannel.dataDirs=/home/hadoop/flume/channel/dataDir5
>
> adServerAgent.channels.fileChannel.checkpointDir=/home/hadoop/flume/channel/checkpointDir5
> adServerAgent.channels.fileChannel.maxFileSize=4000000000
>
> And it is run with :
> JAVA_OPTS = -Xms500m -Xmx700m -Dcom.sun.management.jmxremote
> -XX:MaxDirectMemorySize=2g
>
> Regards,
> Jagadish
>
> On 10/22/2012 05:42 PM, Brock Noland wrote:
>
> Hi,
>
>  I'll respond in more depth later, but it would help if you posted your
> configuration file and the version of flume you are using.
>
>  Brock
>
>  On Mon, Oct 22, 2012 at 6:48 AM, Jagadish Bihani <
> [EMAIL PROTECTED]> wrote:
>
>  Hi
>
> I am writing this on top of another thread where there was discussion on
> "fsync lies" and
> only file channel used fsync and not file sink. :
>
> -- I tested the fsync performance on 2 machines  (On 1 machine I was
> getting very good throughput
> using file channel and on another almost 100 times slower with almost same
> hardware configuration.)
> using following code
>
>
> #define PAGESIZE 4096
>
> int main(int argc, char *argv[])
> {
>
>         char my_write_str[PAGESIZE];
>         char my_read_str[PAGESIZE];
>         char *read_filename= argv[1];
>         int readfd,writefd;
>
>         readfd = open(read_filename,O_RDONLY);
>         writefd = open("written_file",O_WRONLY|O_CREAT,777);
>         int len=lseek(readfd,0,2);
>         lseek(readfd,0,0);
>         int iterations = len/PAGESIZE;
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/