Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Loading Data to HDFS


+
sumit ghosh 2012-10-30, 10:07
+
M. C. Srivas 2012-10-30, 14:24
+
Ranjith 2012-10-31, 04:42
+
Bertrand Dechoux 2012-10-30, 10:16
+
sumit ghosh 2012-10-30, 10:39
+
Bertrand Dechoux 2012-10-30, 11:10
Copy link to this message
-
Re: Loading Data to HDFS
sumit ghosh 2012-10-30, 13:25
Hi Bertrand,

Gateway machine is one which is usually used to connect to the Hadoop cluster however the machine itself does not contain DataNode/Tasktracker.
 
Warm Regards,
Sumit
________________________________
From: Bertrand Dechoux <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; sumit ghosh <[EMAIL PROTECTED]>
Sent: Tuesday, 30 October 2012 4:40 PM
Subject: Re: Loading Data to HDFS
I don't know what you mean by gateway but in order to have a rough idea of the time needed you need 3 values
* amount of data you want to put on hadoop
* hadoop bandwidth with regards to local storage (read/write)
* bandwidth between where your data are stored and where the hadoop cluster is

For the latter, for big volumes, physically moving the volumes is a viable solution.
It will depends on your constraints of course : budget, speed...

Bertrand
On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh <[EMAIL PROTECTED]> wrote:

Hi Bertrand,

>By Physically movi ng the data do you mean that the data volume is connected to the gateway machine and the data is loaded from the local copy using copyFromLocal?

>Thanks,
>Sumit
>
>
>
>________________________________
>From: Bertrand Dechoux <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]; sumit ghosh <[EMAIL PROTECTED]>
>Sent: Tuesday, 30 October 2012 3:46 PM
>Subject: Re: Loading Data to HDFS
>
>
>It might sound like a deprecated way but can't you move the data physically?
>From what I understand, it is one shot and not "streaming" so it could be a
>good method if you the access of course.
>
>Regards
>
>Bertrand
>
>On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I have a data on remote machine accessible over ssh. I have Hadoop CDH4
>> installed on RHEL. I am planning to load quite a few Petabytes of Data onto
>> HDFS.
>>
>> Which will be the fastest method to use and are there any projects around
>> Hadoop which can be used as well?
>>
>>
>> I cannot install Hadoop-Client on the remote machine.
>>
>> Have a great Day Ahead!
>> Sumit.
>>
>>
>> ---------------
>> Here I am attaching my previous discussion on CDH-user to avoid
>> duplication.
>> ---------------
>> On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <[EMAIL PROTECTED]>
>> wrote:
>> in addition to jarcec's suggestions, you could use httpfs. then you'd only
>> need to poke a single host:port in your firewall as all the traffic goes
>> thru it.
>> thx
>> Alejandro
>>
>> On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>
>> wrote:
>> > Hi Sumit,
>> > there is plenty of ways how to achieve that. Please find my feedback
>> below:
>> >
>> >> Does Sqoop support loading flat files to HDFS?
>> >
>> > No, sqoop is supporting only data move from external database and
>> warehouse systems. Copying files is not supported at the moment.
>> >
>> >> Can use distcp?
>> >
>> > No. Distcp can be used only to copy data between HDFS filesystesm.
>> >
>> >> How do we use the core-site.xml file on the remote machine to use
>> >> copyFromLocal?
>> >
>> > Yes you can install hadoop binaries on your machine (with no hadoop
>> running services) and use hadoop binary to upload data. Installation
>> procedure is described in CDH4 installation guide [1] (follow "client"
>> installation).
>> >
>> > Another way that I can think of is leveraging WebHDFS [2] or maybe
>> hdfs-fuse [3]?
>> >
>> > Jarcec
>> >
>> > Links:
>> > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
>> > 2:
>> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
>> > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
>> >
>> > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
>> >>
>> >>
>> >> Hi,
>> >>
>> >> I have a data on remote machine accessible over ssh. What is the fastest
>> >> way to load data onto HDFS?
>> >>
>> >> Does Sqoop support loading flat files to HDFS?
>> >> Can use distcp?
>> >> How do we use the core-site.xml file on the remote machine to use
Bertrand Dechoux
+
Alejandro Abdelnur 2012-10-30, 13:12