Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Recommendation of parameters for better performance with File Channel


Copy link to this message
-
Re: Recommendation of parameters for better performance with File Channel
Juhani Connolly 2012-12-19, 09:23
Hi Jagadish,

You may want to check out the mails "Re: Flume 1.3.0 - NFS + File
Channel Performance"

It turns out the changes in 1609 affect FileChannel performance a fair
bit(even normal non-nfs file systems). We ran a version of 1.3 from an
earlier trunk, and took a big performance hit when we switched to the
1.3 release. I isolated it the FLUME-1609  patch. After building the 1.4
trunk and installing, performance was back to normal.

On 12/18/2012 08:05 PM, Jagadish Bihani wrote:
> Hi
>
> Thanks for the inputs Hari and Brock.
> I had tried for batch size 10000; and throughput increased to 1.8 from
> 1.5 MB/sec.
> Then I  used multiple HDFS sinks which read from the same channel and
> I could get around
> 2.3 MB/sec.
>
> Regards,
> Jagadish
>
>
>
> On 12/13/2012 03:14 AM, Hari Shreedharan wrote:
>> Yep, each sink with a different prefix will work fine too. My
>> suggestion was just meant to avoid collision - file prefixes are good
>> enough for that.
>>
>> --
>> Hari Shreedharan
>>
>> On Wednesday, December 12, 2012 at 1:13 PM, Bhaskar V. Karambelkar wrote:
>>
>>> Hari,
>>> If each sink uses a different file prefix, what's the need to write to
>>> multiple HDFS directories.
>>> All our sinks write to the same HDFS directory and each uses a unique
>>> file prefix, and it seems to work fine.
>>> Also haven't found anything in flume code or HDFS APIs which suggest
>>> that two sinks can't write to the same directory.
>>>
>>> Just curious.
>>> thanks
>>>
>>>
>>> On Wed, Dec 12, 2012 at 12:53 PM, Hari Shreedharan
>>> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>>>> Also note that having multiple sinks often improves performance -
>>>> though you
>>>> should have each sink write to a different directory on HDFS. Since
>>>> each
>>>> sink really uses only on thread at a time to write, having multiple
>>>> sinks
>>>> allows multiple threads to write to HDFS. Also if you can spare
>>>> additional
>>>> disks on your Flume agent machine for file channel data
>>>> directories, that
>>>> will also improve performance.
>>>>
>>>>
>>>>
>>>> Hari
>>>>
>>>> --
>>>> Hari Shreedharan
>>>>
>>>> On Wednesday, December 12, 2012 at 7:36 AM, Brock Noland wrote:
>>>>
>>>> Hi,
>>>>
>>>> Why not try increasing the batch size on the source and sink to 10,000?
>>>>
>>>> Brock
>>>>
>>>> On Wed, Dec 12, 2012 at 4:08 AM, Jagadish Bihani
>>>> <[EMAIL PROTECTED]
>>>> <mailto:[EMAIL PROTECTED]>> wrote:
>>>>
>>>>
>>>> I am using latest release of flume. (Flume 1.3.0) and hadoop 1.0.3.
>>>>
>>>>
>>>> On 12/12/2012 03:35 PM, Jagadish Bihani wrote:
>>>>
>>>>
>>>> Hi
>>>>
>>>> I am able to write maximum 1.5 MB/sec data to HDFS (without
>>>> compression)
>>>> using File Channel. Are there any recommendations to improve the
>>>> performance?
>>>> Has anybody achieved around 10 MB/sec with file channel ? If yes please
>>>> share the
>>>> configuration like (Hardware used, RAM allocated and batch sizes of
>>>> source,sink and channels).
>>>>
>>>> Following are the configuration details :
>>>> =======================>>>>
>>>> I am using a machine with reasonable hardware configuration:
>>>> Quadcore 2.00 GHz processors and 4 GB RAM.
>>>>
>>>> Command line options passed to flume agent :
>>>> -DJAVA_OPTS="-Xms1g -Xmx4g -Dcom.sun.management.jmxremote
>>>> -XX:MaxDirectMemorySize=2g"
>>>>
>>>> Agent Configuration:
>>>> ============>>>> agent.sources = avro-collection-source spooler
>>>> agent.channels = fileChannel
>>>> agent.sinks = hdfsSink fileSink
>>>>
>>>> # For each one of the sources, the type is defined
>>>>
>>>> agent.sources.spooler.type = spooldir
>>>> agent.sources.spooler.spoolDir =/root/test_data
>>>> agent.sources.spooler.batchSize = 1000
>>>> agent.sources.spooler.channels = fileChannel
>>>>
>>>> # Each sink's type must be defined
>>>> agent.sinks.hdfsSink.type = hdfs
>>>> agent.sinks.hdfsSink.hdfs.path=hdfs://mltest2001/flume/release3Test
>>>>
>>>> agent.sinks.hdfsSink.hdfs.fileType =DataStream