I co-authored a paper about this that was published at the NASA/IEEE Mass Storage conference in 2006 . Also, my Ph.D. Dissertation  contains information about making these types of data movement selections when needed. Thought I'd throw it out there in case it helps.
On 3/2/10 11:10 AM, "jiang licht" <[EMAIL PROTECTED]> wrote:
Thanks a lot for sharing your experience. Here I have some questions to bother you for more help :)
So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop fs -put". If this is correct, I am wondering if this will consume too much disk space of your target box (since it is stored in a local file system, prior to be distributed to hadoop cluster). Also, do you do a integrity check for each file transferred (one straightforward method might be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?
I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if it keeps a log of what files have been successfully transferred and what have not, does gridftp do a file integrity check? Right now, I only have one box for data storage (not in hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box and name node box to enable gridftp transfer from the 1st to the 2nd?
--- On Tue, 3/2/10, Brian Bockelman <[EMAIL PROTECTED]> wrote:
From: Brian Bockelman <[EMAIL PROTECTED]>
Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
To: [EMAIL PROTECTED]
Date: Tuesday, March 2, 2010, 8:38 AM
distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.
Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers. GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.
GridFTP is fast too - many Gbps is not too hard.
On Mar 2, 2010, at 1:30 AM, jiang licht wrote:
> I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.
> An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.
> But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).
> I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):
> cat datafile | ssh hadoopbox 'hadoop fs -put - dst'
> There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.
> Any comments about this or thoughts about a better solution?
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [EMAIL PROTECTED]
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA