Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Loading Data to HDFS


Copy link to this message
-
Loading Data to HDFS
Hi,

I have a data on remote machine accessible over ssh. I have Hadoop CDH4 installed on RHEL. I am planning to load quite a few Petabytes of Data onto HDFS.
 
Which will be the fastest method to use and are there any projects around Hadoop which can be used as well?

 
I cannot install Hadoop-Client on the remote machine.
 
Have a great Day Ahead!
Sumit.
 
 
---------------
Here I am attaching my previous discussion on CDH-user to avoid duplication.
---------------
On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote:
in addition to jarcec's suggestions, you could use httpfs. then you'd only need to poke a single host:port in your firewall as all the traffic goes thru it.
thx
Alejandro

On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <[EMAIL PROTECTED]> wrote:
> Hi Sumit,
> there is plenty of ways how to achieve that. Please find my feedback below:
>
>> Does Sqoop support loading flat files to HDFS?
>
> No, sqoop is supporting only data move from external database and warehouse systems. Copying files is not supported at the moment.
>
>> Can use distcp?
>
> No. Distcp can be used only to copy data between HDFS filesystesm.
>
>> How do we use the core-site.xml file on the remote machine to use
>> copyFromLocal?
>
> Yes you can install hadoop binaries on your machine (with no hadoop running services) and use hadoop binary to upload data. Installation procedure is described in CDH4 installation guide [1] (follow "client" installation).
>
> Another way that I can think of is leveraging WebHDFS [2] or maybe hdfs-fuse [3]?
>
> Jarcec
>
> Links:
> 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
> 2: https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
> 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
>
> On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
>>
>>
>> Hi,
>>
>> I have a data on remote machine accessible over ssh. What is the fastest
>> way to load data onto HDFS?
>>
>> Does Sqoop support loading flat files to HDFS?
>> Can use distcp?
>> How do we use the core-site.xml file on the remote machine to use
>> copyFromLocal?
>>
>> Which will be the best to use and are there any other open source projects
>> around Hadoop which can be used as well?
>> Have a great Day Ahead!
>> Sumit
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB