I am also working on similar requirement, one approach is,
mount your remote folder on your hadoop master node.
And simply write a shell script to copy the files to HDFS
I believe Flume is literally a wrong choice as Flume is a
data collection and aggregation framework and NOT a file transfer tool and
may NOT be a good choice when you actually want to copy the files as-is
onto your cluster (NOT 100% sure as I am also working on that).
On Fri, Jan 25, 2013 at 6:39 AM, Panshul Whisper <[EMAIL PROTECTED]>wrote:
> I am trying to copy files, Json files from a remote folder - (a folder on
> my local system, Cloudfiles folder or a folder on S3 server) to the HDFS of
> a cluster running at a remote location.
> The job submitting Application is based on Spring Hadoop.
> Can someone please suggest or point me in the right direction for best
> option to achieve the above task:
> 1. Use Spring Integration data pipelines to poll the folders for files and
> copy them to the HDFS as they arrive in the source folder. - I have tried
> to implement the solution in Spring Data book, but it does not run - no
> idea what is wrong as it does not generate logs.
> 2. Use some other script method to transfer files.
> Main requirement, I need to transfer files from a remote folder to HDFS
> everyday at a fixed time for processing in the hadoop cluster. These files
> are collecting from various sources in the remote folders.
> Please suggest an efficient approach. I have been searching and finding a
> lot of approaches but unable to decide what will work best. As this
> transfer needs to be as fast as possible.
> The files to be transferred will be almost 10 GB of Json files not more
> than 6kb each file.
> Thanking You,
> Ouch Whisper