|
Jagadish Bihani
2012-10-09, 07:46
Brock Noland
2012-10-09, 14:31
Jagadish Bihani
2012-10-10, 10:11
Brock Noland
2012-10-10, 15:54
Jagadish Bihani
2012-10-10, 16:00
Brock Noland
2012-10-10, 16:05
Jagadish Bihani
2012-10-10, 16:22
Brock Noland
2012-10-10, 18:00
|
-
Flume throughput correlation with RAMJagadish Bihani 2012-10-09, 07:46
Hi
My flume setup is: Source Agent : cat source - File Channel - Avro Sink Dest Agent : avro source - File Channel - HDFS Sink. There is only 1 source agent and 1 destination agent. I measure throughput as amount of data written to HDFS per second. ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec the throughput is : -- 2 MB/sec ). I have run *source agent on various machines *with different hardware configurations : (In all cases I run flume agent with JAVA OPTIONS as "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote -XX:MaxDirectMemorySize=2g") JDK is 32 bit. Experiment 1: ====RAM : 16 GB Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). 64 bit Processor with 64 bit Kernel. Throughput: 2 MB/sec Experiment 2: =====RAM : 4 GB Processor: Intel Xeon E5504 @ 2.00GHz (4 cores). 32 bit Processor 64 bit Processor with 32 bit Kernel. Throughput : 30 KB/sec Experiment 3: =====RAM : 8 GB Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor 64 bit Processor with 32 bit Kernel. Throughput : 80 KB/sec -- So as can be seen there is huge difference in the throughput with same configuration but different hardware. -- In the first case where throughput is more RES is around 160 MB in other cases it is in the range of 40 MB - 50 MB. Can anybody please give insights that why there is this huge difference in the throughput? What is the correlation between RAM and filechannel/HDFS sink performance and also with 32-bit/64 bit kernel? Regards, Jagadish +
Jagadish Bihani 2012-10-09, 07:46
-
Re: Flume throughput correlation with RAMBrock Noland 2012-10-09, 14:31
Hi,
Using file channel, in terms of performance, the number and type of disks is going to be much more predictive of performance than CPU or RAM. Note that consumer level drives/controllers will give you much "better" performance because they lie to you about when your data is actually written to the drive. If you search for "fsync lies" you'll find more information on this. You probably want to increase the batch size to get better performance. Brock On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani <[EMAIL PROTECTED]> wrote: > Hi > > My flume setup is: > > Source Agent : cat source - File Channel - Avro Sink > Dest Agent : avro source - File Channel - HDFS Sink. > > There is only 1 source agent and 1 destination agent. > > I measure throughput as amount of data written to HDFS per second. > ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec > the > throughput is : -- 2 MB/sec ). > > I have run source agent on various machines with different hardware > configurations : > (In all cases I run flume agent with JAVA OPTIONS as > "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote > -XX:MaxDirectMemorySize=2g") > > JDK is 32 bit. > > Experiment 1: > ====> RAM : 16 GB > Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). > 64 bit Processor with 64 bit Kernel. > Throughput: 2 MB/sec > > Experiment 2: > =====> RAM : 4 GB > Processor: Intel Xeon E5504 @ 2.00GHz (4 cores). 32 bit Processor > 64 bit Processor with 32 bit Kernel. > Throughput : 30 KB/sec > > Experiment 3: > =====> RAM : 8 GB > Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor > 64 bit Processor with 32 bit Kernel. > Throughput : 80 KB/sec > > -- So as can be seen there is huge difference in the throughput with same > configuration but > different hardware. > -- In the first case where throughput is more RES is around 160 MB in other > cases it is in > the range of 40 MB - 50 MB. > > Can anybody please give insights that why there is this huge difference in > the throughput? > What is the correlation between RAM and filechannel/HDFS sink performance > and also > with 32-bit/64 bit kernel? > > Regards, > Jagadish -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ +
Brock Noland 2012-10-09, 14:31
-
Re: Flume throughput correlation with RAMJagadish Bihani 2012-10-10, 10:11
Hi
Thanks for the inputs Brock. After doing several experiments eventually problem boiled down to disks. -- But I had used the same configuration (so all software components are same in all 3 machines) on all 3 machines. -- In User guide it is written that if multiple file channel instances are active on the same agent then different disks are preferable. But in my case *only one file channel is active per agent.* -- Only one pattern I observed that on the machines where I got better performance have multiple disks. But I don't understand how that will help if I have only 1 active file channel. -- What is the impact of the type of disk/disk device driver on performance? I mean I don't understand with 1 disk I am getting 40 KB/sec and with other 2 MB/sec. Could you please elaborate on File channel and disks correlation. Regards, Jagadish On 10/09/2012 08:01 PM, Brock Noland wrote: > Hi, > > Using file channel, in terms of performance, the number and type of > disks is going to be much more predictive of performance than CPU or > RAM. Note that consumer level drives/controllers will give you much > "better" performance because they lie to you about when your data is > actually written to the drive. If you search for "fsync lies" you'll > find more information on this. > > You probably want to increase the batch size to get better performance. > > Brock > > On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani > <[EMAIL PROTECTED]> wrote: >> Hi >> >> My flume setup is: >> >> Source Agent : cat source - File Channel - Avro Sink >> Dest Agent : avro source - File Channel - HDFS Sink. >> >> There is only 1 source agent and 1 destination agent. >> >> I measure throughput as amount of data written to HDFS per second. >> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec >> the >> throughput is : -- 2 MB/sec ). >> >> I have run source agent on various machines with different hardware >> configurations : >> (In all cases I run flume agent with JAVA OPTIONS as >> "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote >> -XX:MaxDirectMemorySize=2g") >> >> JDK is 32 bit. >> >> Experiment 1: >> ====>> RAM : 16 GB >> Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). >> 64 bit Processor with 64 bit Kernel. >> Throughput: 2 MB/sec >> >> Experiment 2: >> =====>> RAM : 4 GB >> Processor: Intel Xeon E5504 @ 2.00GHz (4 cores). 32 bit Processor >> 64 bit Processor with 32 bit Kernel. >> Throughput : 30 KB/sec >> >> Experiment 3: >> =====>> RAM : 8 GB >> Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor >> 64 bit Processor with 32 bit Kernel. >> Throughput : 80 KB/sec >> >> -- So as can be seen there is huge difference in the throughput with same >> configuration but >> different hardware. >> -- In the first case where throughput is more RES is around 160 MB in other >> cases it is in >> the range of 40 MB - 50 MB. >> >> Can anybody please give insights that why there is this huge difference in >> the throughput? >> What is the correlation between RAM and filechannel/HDFS sink performance >> and also >> with 32-bit/64 bit kernel? >> >> Regards, >> Jagadish > +
Jagadish Bihani 2012-10-10, 10:11
-
Re: Flume throughput correlation with RAMBrock Noland 2012-10-10, 15:54
How big are your events? Average about 400 bytes?
Brock On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani <[EMAIL PROTECTED]> wrote: > Hi > > Thanks for the inputs Brock. After doing several experiments > eventually problem boiled down to disks. > > -- But I had used the same configuration (so all software components are > same in all 3 machines) > on all 3 machines. > -- In User guide it is written that if multiple file channel instances are > active on the same agent then > different disks are preferable. But in my case only one file channel is > active per agent. > -- Only one pattern I observed that on the machines where I got better > performance have multiple disks. > But I don't understand how that will help if I have only 1 active file > channel. > -- What is the impact of the type of disk/disk device driver on performance? > I mean I don't understand > with 1 disk I am getting 40 KB/sec and with other 2 MB/sec. > > Could you please elaborate on File channel and disks correlation. > > Regards, > Jagadish > > > On 10/09/2012 08:01 PM, Brock Noland wrote: > > Hi, > > Using file channel, in terms of performance, the number and type of > disks is going to be much more predictive of performance than CPU or > RAM. Note that consumer level drives/controllers will give you much > "better" performance because they lie to you about when your data is > actually written to the drive. If you search for "fsync lies" you'll > find more information on this. > > You probably want to increase the batch size to get better performance. > > Brock > > On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani > <[EMAIL PROTECTED]> wrote: > > Hi > > My flume setup is: > > Source Agent : cat source - File Channel - Avro Sink > Dest Agent : avro source - File Channel - HDFS Sink. > > There is only 1 source agent and 1 destination agent. > > I measure throughput as amount of data written to HDFS per second. > ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec > the > throughput is : -- 2 MB/sec ). > > I have run source agent on various machines with different hardware > configurations : > (In all cases I run flume agent with JAVA OPTIONS as > "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote > -XX:MaxDirectMemorySize=2g") > > JDK is 32 bit. > > Experiment 1: > ====> RAM : 16 GB > Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). > 64 bit Processor with 64 bit Kernel. > Throughput: 2 MB/sec > > Experiment 2: > =====> RAM : 4 GB > Processor: Intel Xeon E5504 @ 2.00GHz (4 cores). 32 bit Processor > 64 bit Processor with 32 bit Kernel. > Throughput : 30 KB/sec > > Experiment 3: > =====> RAM : 8 GB > Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor > 64 bit Processor with 32 bit Kernel. > Throughput : 80 KB/sec > > -- So as can be seen there is huge difference in the throughput with same > configuration but > different hardware. > -- In the first case where throughput is more RES is around 160 MB in other > cases it is in > the range of 40 MB - 50 MB. > > Can anybody please give insights that why there is this huge difference in > the throughput? > What is the correlation between RAM and filechannel/HDFS sink performance > and also > with 32-bit/64 bit kernel? > > Regards, > Jagadish > > > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ +
Brock Noland 2012-10-10, 15:54
-
Re: Flume throughput correlation with RAMJagadish Bihani 2012-10-10, 16:00
Hi
Yes. It is around 480 - 500 bytes. On 10/10/2012 09:24 PM, Brock Noland wrote: > How big are your events? Average about 400 bytes? > > Brock > > On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani > <[EMAIL PROTECTED]> wrote: >> Hi >> >> Thanks for the inputs Brock. After doing several experiments >> eventually problem boiled down to disks. >> >> -- But I had used the same configuration (so all software components are >> same in all 3 machines) >> on all 3 machines. >> -- In User guide it is written that if multiple file channel instances are >> active on the same agent then >> different disks are preferable. But in my case only one file channel is >> active per agent. >> -- Only one pattern I observed that on the machines where I got better >> performance have multiple disks. >> But I don't understand how that will help if I have only 1 active file >> channel. >> -- What is the impact of the type of disk/disk device driver on performance? >> I mean I don't understand >> with 1 disk I am getting 40 KB/sec and with other 2 MB/sec. >> >> Could you please elaborate on File channel and disks correlation. >> >> Regards, >> Jagadish >> >> >> On 10/09/2012 08:01 PM, Brock Noland wrote: >> >> Hi, >> >> Using file channel, in terms of performance, the number and type of >> disks is going to be much more predictive of performance than CPU or >> RAM. Note that consumer level drives/controllers will give you much >> "better" performance because they lie to you about when your data is >> actually written to the drive. If you search for "fsync lies" you'll >> find more information on this. >> >> You probably want to increase the batch size to get better performance. >> >> Brock >> >> On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani >> <[EMAIL PROTECTED]> wrote: >> >> Hi >> >> My flume setup is: >> >> Source Agent : cat source - File Channel - Avro Sink >> Dest Agent : avro source - File Channel - HDFS Sink. >> >> There is only 1 source agent and 1 destination agent. >> >> I measure throughput as amount of data written to HDFS per second. >> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec >> the >> throughput is : -- 2 MB/sec ). >> >> I have run source agent on various machines with different hardware >> configurations : >> (In all cases I run flume agent with JAVA OPTIONS as >> "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote >> -XX:MaxDirectMemorySize=2g") >> >> JDK is 32 bit. >> >> Experiment 1: >> ====>> RAM : 16 GB >> Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). >> 64 bit Processor with 64 bit Kernel. >> Throughput: 2 MB/sec >> >> Experiment 2: >> =====>> RAM : 4 GB >> Processor: Intel Xeon E5504 @ 2.00GHz (4 cores). 32 bit Processor >> 64 bit Processor with 32 bit Kernel. >> Throughput : 30 KB/sec >> >> Experiment 3: >> =====>> RAM : 8 GB >> Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor >> 64 bit Processor with 32 bit Kernel. >> Throughput : 80 KB/sec >> >> -- So as can be seen there is huge difference in the throughput with same >> configuration but >> different hardware. >> -- In the first case where throughput is more RES is around 160 MB in other >> cases it is in >> the range of 40 MB - 50 MB. >> >> Can anybody please give insights that why there is this huge difference in >> the throughput? >> What is the correlation between RAM and filechannel/HDFS sink performance >> and also >> with 32-bit/64 bit kernel? >> >> Regards, >> Jagadish >> >> >> > > +
Jagadish Bihani 2012-10-10, 16:00
-
Re: Flume throughput correlation with RAMBrock Noland 2012-10-10, 16:05
OK your disk that is giving you 40KB/second is telling you the truth
and the faster disk is lying to you. Look up "fsync lies" to see what I am referring to. A spinning disk can do 100 fsync operations per second (this is done at the end of every batch). That is how I estimated your event size, 40KB/second is doing 40KB / 100 = 409 bytes. Once again, if you want increased performance, you should increase the batch size. Brock On Wed, Oct 10, 2012 at 11:00 AM, Jagadish Bihani <[EMAIL PROTECTED]> wrote: > Hi > > Yes. It is around 480 - 500 bytes. > > > On 10/10/2012 09:24 PM, Brock Noland wrote: >> >> How big are your events? Average about 400 bytes? >> >> Brock >> >> On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani >> <[EMAIL PROTECTED]> wrote: >>> >>> Hi >>> >>> Thanks for the inputs Brock. After doing several experiments >>> eventually problem boiled down to disks. >>> >>> -- But I had used the same configuration (so all software components >>> are >>> same in all 3 machines) >>> on all 3 machines. >>> -- In User guide it is written that if multiple file channel instances >>> are >>> active on the same agent then >>> different disks are preferable. But in my case only one file channel is >>> active per agent. >>> -- Only one pattern I observed that on the machines where I got better >>> performance have multiple disks. >>> But I don't understand how that will help if I have only 1 active file >>> channel. >>> -- What is the impact of the type of disk/disk device driver on >>> performance? >>> I mean I don't understand >>> with 1 disk I am getting 40 KB/sec and with other 2 MB/sec. >>> >>> Could you please elaborate on File channel and disks correlation. >>> >>> Regards, >>> Jagadish >>> >>> >>> On 10/09/2012 08:01 PM, Brock Noland wrote: >>> >>> Hi, >>> >>> Using file channel, in terms of performance, the number and type of >>> disks is going to be much more predictive of performance than CPU or >>> RAM. Note that consumer level drives/controllers will give you much >>> "better" performance because they lie to you about when your data is >>> actually written to the drive. If you search for "fsync lies" you'll >>> find more information on this. >>> >>> You probably want to increase the batch size to get better performance. >>> >>> Brock >>> >>> On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani >>> <[EMAIL PROTECTED]> wrote: >>> >>> Hi >>> >>> My flume setup is: >>> >>> Source Agent : cat source - File Channel - Avro Sink >>> Dest Agent : avro source - File Channel - HDFS Sink. >>> >>> There is only 1 source agent and 1 destination agent. >>> >>> I measure throughput as amount of data written to HDFS per second. >>> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec >>> the >>> throughput is : -- 2 MB/sec ). >>> >>> I have run source agent on various machines with different hardware >>> configurations : >>> (In all cases I run flume agent with JAVA OPTIONS as >>> "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote >>> -XX:MaxDirectMemorySize=2g") >>> >>> JDK is 32 bit. >>> >>> Experiment 1: >>> ====>>> RAM : 16 GB >>> Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). >>> 64 bit Processor with 64 bit Kernel. >>> Throughput: 2 MB/sec >>> >>> Experiment 2: >>> =====>>> RAM : 4 GB >>> Processor: Intel Xeon E5504 @ 2.00GHz (4 cores). 32 bit Processor >>> 64 bit Processor with 32 bit Kernel. >>> Throughput : 30 KB/sec >>> >>> Experiment 3: >>> =====>>> RAM : 8 GB >>> Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor >>> 64 bit Processor with 32 bit Kernel. >>> Throughput : 80 KB/sec >>> >>> -- So as can be seen there is huge difference in the throughput with >>> same >>> configuration but >>> different hardware. >>> -- In the first case where throughput is more RES is around 160 MB in >>> other >>> cases it is in >>> the range of 40 MB - 50 MB. >>> >>> Can anybody please give insights that why there is this huge difference >>> in >>> the throughput? Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ +
Brock Noland 2012-10-10, 16:05
-
Re: Flume throughput correlation with RAMJagadish Bihani 2012-10-10, 16:22
Hi Brock
I will surely look into 'fsync lies'. But as per my experiments I think "file channel" is causing the issue. Because on those 2 machines (one with higher throughput and other with lower) I did following experiment: cat Source -memory channel - file sink. Now with this setup I got same throughput on both the machines. (around 3 MB/sec) Now as I have used "File sink" it should also do "fsync" at some point of time. 'File Sink' and 'File Channel' both do disk writes. So if there is differences in disk behaviour then even in the 'File Sink' it should be visible. Am I missing something here? Regards, Jagadish On 10/10/2012 09:35 PM, Brock Noland wrote: > OK your disk that is giving you 40KB/second is telling you the truth > and the faster disk is lying to you. Look up "fsync lies" to see what > I am referring to. > > A spinning disk can do 100 fsync operations per second (this is done > at the end of every batch). That is how I estimated your event size, > 40KB/second is doing 40KB / 100 = 409 bytes. > > Once again, if you want increased performance, you should increase the > batch size. > > Brock > > On Wed, Oct 10, 2012 at 11:00 AM, Jagadish Bihani > <[EMAIL PROTECTED]> wrote: >> Hi >> >> Yes. It is around 480 - 500 bytes. >> >> >> On 10/10/2012 09:24 PM, Brock Noland wrote: >>> How big are your events? Average about 400 bytes? >>> >>> Brock >>> >>> On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani >>> <[EMAIL PROTECTED]> wrote: >>>> Hi >>>> >>>> Thanks for the inputs Brock. After doing several experiments >>>> eventually problem boiled down to disks. >>>> >>>> -- But I had used the same configuration (so all software components >>>> are >>>> same in all 3 machines) >>>> on all 3 machines. >>>> -- In User guide it is written that if multiple file channel instances >>>> are >>>> active on the same agent then >>>> different disks are preferable. But in my case only one file channel is >>>> active per agent. >>>> -- Only one pattern I observed that on the machines where I got better >>>> performance have multiple disks. >>>> But I don't understand how that will help if I have only 1 active file >>>> channel. >>>> -- What is the impact of the type of disk/disk device driver on >>>> performance? >>>> I mean I don't understand >>>> with 1 disk I am getting 40 KB/sec and with other 2 MB/sec. >>>> >>>> Could you please elaborate on File channel and disks correlation. >>>> >>>> Regards, >>>> Jagadish >>>> >>>> >>>> On 10/09/2012 08:01 PM, Brock Noland wrote: >>>> >>>> Hi, >>>> >>>> Using file channel, in terms of performance, the number and type of >>>> disks is going to be much more predictive of performance than CPU or >>>> RAM. Note that consumer level drives/controllers will give you much >>>> "better" performance because they lie to you about when your data is >>>> actually written to the drive. If you search for "fsync lies" you'll >>>> find more information on this. >>>> >>>> You probably want to increase the batch size to get better performance. >>>> >>>> Brock >>>> >>>> On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani >>>> <[EMAIL PROTECTED]> wrote: >>>> >>>> Hi >>>> >>>> My flume setup is: >>>> >>>> Source Agent : cat source - File Channel - Avro Sink >>>> Dest Agent : avro source - File Channel - HDFS Sink. >>>> >>>> There is only 1 source agent and 1 destination agent. >>>> >>>> I measure throughput as amount of data written to HDFS per second. >>>> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec >>>> the >>>> throughput is : -- 2 MB/sec ). >>>> >>>> I have run source agent on various machines with different hardware >>>> configurations : >>>> (In all cases I run flume agent with JAVA OPTIONS as >>>> "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote >>>> -XX:MaxDirectMemorySize=2g") >>>> >>>> JDK is 32 bit. >>>> >>>> Experiment 1: >>>> ====>>>> RAM : 16 GB >>>> Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). >>>> 64 bit Processor with 64 bit Kernel. +
Jagadish Bihani 2012-10-10, 16:22
-
Re: Flume throughput correlation with RAMBrock Noland 2012-10-10, 18:00
Hi,
On Wed, Oct 10, 2012 at 11:22 AM, Jagadish Bihani <[EMAIL PROTECTED]> wrote: > Hi Brock > > I will surely look into 'fsync lies'. > > But as per my experiments I think "file channel" is causing the issue. > Because on those 2 machines (one with higher throughput and other with > lower) > I did following experiment: > > cat Source -memory channel - file sink. > > Now with this setup I got same throughput on both the machines. (around 3 > MB/sec) > Now as I have used "File sink" it should also do "fsync" at some point of > time. > 'File Sink' and 'File Channel' both do disk writes. > So if there is differences in disk behaviour then even in the 'File Sink' it > should be visible. > > Am I missing something here? File sink does not call fsync. > > Regards, > Jagadish > > > > On 10/10/2012 09:35 PM, Brock Noland wrote: >> >> OK your disk that is giving you 40KB/second is telling you the truth >> and the faster disk is lying to you. Look up "fsync lies" to see what >> I am referring to. >> >> A spinning disk can do 100 fsync operations per second (this is done >> at the end of every batch). That is how I estimated your event size, >> 40KB/second is doing 40KB / 100 = 409 bytes. >> >> Once again, if you want increased performance, you should increase the >> batch size. >> >> Brock >> >> On Wed, Oct 10, 2012 at 11:00 AM, Jagadish Bihani >> <[EMAIL PROTECTED]> wrote: >>> >>> Hi >>> >>> Yes. It is around 480 - 500 bytes. >>> >>> >>> On 10/10/2012 09:24 PM, Brock Noland wrote: >>>> >>>> How big are your events? Average about 400 bytes? >>>> >>>> Brock >>>> >>>> On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani >>>> <[EMAIL PROTECTED]> wrote: >>>>> >>>>> Hi >>>>> >>>>> Thanks for the inputs Brock. After doing several experiments >>>>> eventually problem boiled down to disks. >>>>> >>>>> -- But I had used the same configuration (so all software components >>>>> are >>>>> same in all 3 machines) >>>>> on all 3 machines. >>>>> -- In User guide it is written that if multiple file channel instances >>>>> are >>>>> active on the same agent then >>>>> different disks are preferable. But in my case only one file channel is >>>>> active per agent. >>>>> -- Only one pattern I observed that on the machines where I got better >>>>> performance have multiple disks. >>>>> But I don't understand how that will help if I have only 1 active file >>>>> channel. >>>>> -- What is the impact of the type of disk/disk device driver on >>>>> performance? >>>>> I mean I don't understand >>>>> with 1 disk I am getting 40 KB/sec and with other 2 MB/sec. >>>>> >>>>> Could you please elaborate on File channel and disks correlation. >>>>> >>>>> Regards, >>>>> Jagadish >>>>> >>>>> >>>>> On 10/09/2012 08:01 PM, Brock Noland wrote: >>>>> >>>>> Hi, >>>>> >>>>> Using file channel, in terms of performance, the number and type of >>>>> disks is going to be much more predictive of performance than CPU or >>>>> RAM. Note that consumer level drives/controllers will give you much >>>>> "better" performance because they lie to you about when your data is >>>>> actually written to the drive. If you search for "fsync lies" you'll >>>>> find more information on this. >>>>> >>>>> You probably want to increase the batch size to get better performance. >>>>> >>>>> Brock >>>>> >>>>> On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani >>>>> <[EMAIL PROTECTED]> wrote: >>>>> >>>>> Hi >>>>> >>>>> My flume setup is: >>>>> >>>>> Source Agent : cat source - File Channel - Avro Sink >>>>> Dest Agent : avro source - File Channel - HDFS Sink. >>>>> >>>>> There is only 1 source agent and 1 destination agent. >>>>> >>>>> I measure throughput as amount of data written to HDFS per second. >>>>> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 >>>>> sec >>>>> the >>>>> throughput is : -- 2 MB/sec ). >>>>> >>>>> I have run source agent on various machines with different hardware >>>>> configurations : >>>>> (In all cases I run flume agent with JAVA OPTIONS as Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ +
Brock Noland 2012-10-10, 18:00
|