Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # dev - Moving TB of data from NFS to HDFS

Copy link to this message
Re: Moving TB of data from NFS to HDFS
Rajiv Chittajallu 2012-01-25, 12:28
You will more likely be hitting NFS server limits way before you can see any noticible issues with HDFS.

Writes to a file are sequential. Total throughput for your transfer is dependent on number of files and the rate at which files can be read from
NFS. If the total data set is split across reasonable number of files, say 2G, Upload rate can be matched to the NFS server limits.

On a small cluster, mounting the filesystem via NSF and using distcp with input path as file:///<path> would work.

Another option is making your files available via HTTP and runnin a simple streaming job to parallelize the data pull.

It basically comes down to how you want to initiate the parallel copies.


On Jan 25, 2012, at 1:19, Ajit Ratnaparkhi <[EMAIL PROTECTED]> wrote:

> Hi raj,
> If you have all data on NFS mounted disk, meaning on single machine, then
> your upload will be limited by network bandwidth. You can try running dfs
> -put in multiple parallel threads for distinct data sets, you might be able
> to utilise network bandwidth to its maximum(take care not to have too many
> threads otherwise namenode handlers will be busy all the time making dfs
> unresponsive). I dont see any other way to make it faster, making data
> upload faster require data source to be present at distributed locations
> which is not true in this case.
> -Ajit
> On Wed, Jan 25, 2012 at 10:46 AM, Praveen Sripati
>>> If it is divided up into several files and you can mount your NFS
>> directory on each of the datanodes.
>> Just curious, how will this help.
>> Praveen
>> On Wed, Jan 25, 2012 at 12:39 AM, Robert Evans <[EMAIL PROTECTED]>
>> wrote:
>>> If it is divided up into several files and you can mount your NFS
>>> directory on each of the datanodes, you could possibly use distcp to do
>> it.
>>> I have never tried using distcp for this, but it should work.  Or you
>> can
>>> write your own streaming Map/Reduce script that does more or less the
>> same
>>> thing as distcp and will take as input the list of files to copy, and
>> will
>>> do a hadoop fs -put for each file having it come from NFS.
>>> --Bobby Evans
>>> On 1/24/12 12:49 AM, "rajmca2002" <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>> I have TB of Data in NFS i need to move this data to hdfs. I have used
>>> hadoop put command to do the same, but it resulted in taking hours to
>> place
>>> the file in HDFS, Is there any good approach to move large files to hdfs.
>>> Please reply asap.
>>> --
>>> View this message in context:
>> http://old.nabble.com/Moving-TB-of-data-from-NFS-to-HDFS-tp33193061p33193061.html
>>> Sent from the Hadoop core-dev mailing list archive at Nabble.com.