-Re: Use flume to copy data in local directory (hadoop server) into hdfs
DSuiter RDX 2013-10-24, 17:57
You might want to set up some redundant/load-balancing channels and sinks,
so if one sink is tied up, the operation can be attempted on another sink.
I am not very experienced with that arrangement yet, and so cannot guide
you very much, but have seen that mentioned as a means to ensure delivery
when there is too much going on. The source does not need to change, since
it will replicate to any channels automatically and the sinks can get their
own channels for their input.
I'm not certain that Flume is a good way to handle such a large file, it
seems that Flume is designed to have many small files, and can aggregate
them and such.
But, if the file you are uploading is in a place in local filesystem, can't
you just use a cron entry to run "hadoop fs -put $FILE $HDFS/INPUT/PATH"
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Thu, Oct 24, 2013 at 11:35 AM, ltcuong211 <[EMAIL PROTECTED]> wrote:
> Hi Jeff & JS,
> I tried using spooling dir source & memory channel. It still takes ~ 4
> minutes to copy 1gb data into hdfs.
> By the way, thanks for suggesting spooling source. I think it is better
> than exec + cat in my case.
> Cuong LUU
> On 21/10/2013 22:50, Jeff Lord wrote:
> Have you tried using the spooling directory source?
> On Mon, Oct 21, 2013 at 3:25 AM, Cuong Luu <[EMAIL PROTECTED]> wrote:
>> Hi all,
>> I need to copy data in a local directory (hadoop server) into hdfs
>> regularly and automatically. This is my flume config:
>> agent.sources = execSource
>> agent.channels = fileChannel
>> agent.sinks = hdfsSink
>> agent.sources.execSource.type = exec
>> agent.sources.execSource.shell = /bin/bash -c
>> agent.sources.execSource.command = for i in /local-dir/*; do cat $i; done
>> agent.sources.execSource.restart = true
>> agent.sources.execSource.restartThrottle = 3600000
>> agent.sources.execSource.batchSize = 100
>> agent.sinks.hdfsSink.hdfs.rollInterval = 0
>> agent.sinks.hdfsSink.hdfs.rollSize = 262144000
>> agent.sinks.hdfsSink.hdfs.rollCount = 0
>> agent.sinks.hdfsSink.batchsize = 100000
>> agent.channels.fileChannel.type = FILE
>> agent.channels.fileChannel.capacity = 100000
>> while hadoop command takes 30second, Flume takes arround 4 minutes to
>> copy 1 gb text file into HDFS. I am worried about whether the config is not
>> good or shouldn't use flume in this case?
>> How about your opinion?