Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput


+
Himanish Kushary 2013-03-28, 03:54
+
Ted Dunning 2013-03-28, 07:45
+
David Parks 2013-03-28, 07:56
+
Himanish Kushary 2013-03-28, 10:51
+
David Parks 2013-03-29, 05:41
Copy link to this message
-
Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
Thanks Dave.

I had already tried using the s3distcp jar. But got stuck on the below
error,which made me think that this is something specific to Amazon hadoop
distribution.

Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream

Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
is not present on the CDH4 (my local env) hadoop jars.

Could you suggest how I could get around this issue. One option could be
using the amazon specific jars but then probably I would need to get all
the jars ( else it could cause version mismatch errors for HDFS -
NoSuchMethodError etc etc )

Appreciate your help regarding this.

- Himanish

On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[EMAIL PROTECTED]> wrote:

> None of that complexity, they distribute the jar publicly (not the source,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
> ** **
>
> No VPN or anything, if you can access the internet you can get to S3. ****
>
> ** **
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
> ****
>
> ** **
>
> Doesn’t matter where you’re Hadoop instance is running.****
>
> ** **
>
> Here’s an example of code/parameters I used to run it from within another
> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
> line normally.****
>
> ** **
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/"+ JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
> ** **
>
> Watch the “srcPattern”, make sure you have that leading `.*`, that one
> threw me for a loop once.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Hi Dave,****
>
> ** **
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
> ** **
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[EMAIL PROTECTED]>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid

Thanks & Regards
Himanish
+
David Parks 2013-03-29, 14:34