Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput


+
Himanish Kushary 2013-03-29, 14:57
+
David Parks 2013-03-31, 01:26
Copy link to this message
-
Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
I was able to transfer the data to S3 successfully with the earlier
mentioned work-around.Also I was able to max out our available upload
bandwidth.I could get average around 10 MB/s from the cluster.

I ran the s3distcp jobs with the default timeout and did not face any
issues.

Thanks all for the help.

Himanish
On Sat, Mar 30, 2013 at 9:26 PM, David Parks <[EMAIL PROTECTED]> wrote:

> 4-20MB/sec are common transfer rates from S3 to **1** local AWS box, this
> was, of course, a cluster, and s3distcp is specifically designed to take
> advantage of the cluster, so it was a 45 minute job to transfer the 1.5 TB
> to the full cluster of, I forget how many servers I had at the time, maybe
> 15-30 m1.xlarge. The numbers are rough, I could be mistaken and it was 1 ½
> hours to do the transfer (but I recall 45 min), in either case the s3distcp
> job ran longer than the task timeout period, which was the real point I was
> focusing on.****
>
> ** **
>
> I seem to recall needing to re-package their jar as well, but for
> different reasons, they package in some other open source utilities and I
> had version conflicts, so might want to watch for that.****
>
> ** **
>
> I’ve never seen this ProgressableResettableBufferedFileInputStream, so I
> can’t offer much more advise on that one.****
>
> ** **
>
> Good luck! Let us know how it turns out.****
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:[EMAIL PROTECTED]]
> *Sent:* Friday, March 29, 2013 9:57 PM
>
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs
> for 1.0.4 branch (could not find 1.0.3 API's so used
> http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find the"ProgressableResettableBufferedFileInputStream"
> class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.****
>
> ** **
>
> In the meantime I have come out with a dirty workaround by extracting the
> class from the Amazon jar and packaging it into its own separate jar.I am
> actually able to run the s3distcp now on local CDH4 using amazon's jar and
> transfer from my local hadoop to Amazon S3.****
>
> ** **
>
> But the real issue is the throughput. You mentioned that you had
> transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely
> getting 4 MB/s upload speed !! How did you get 100x times speed compared to
> me ? Could you please share any settings/tweaks that you may have done
> to achieve this. Were you on some very specific high bandwidth network ?
> Was is between HDFS on EC2 and amazon S3 ?****
>
> ** **
>
> Looking forward to hear from you.****
>
> ** **
>
> Thanks****
>
> Himanish****
>
> ** **
>
> On Fri, Mar 29, 2013 at 10:34 AM, David Parks <[EMAIL PROTECTED]>
> wrote:****
>
> CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used
> it primarily with 1.0.3, which is what AWS uses, so I presume that's what
> it's tested on.****
>
>
>
> Himanish Kushary <[EMAIL PROTECTED]> wrote:****
>
> Thanks Dave.****
>
> ** **
>
> I had already tried using the s3distcp jar. But got stuck on the below
> error,which made me think that this is something specific to Amazon hadoop
> distribution.****
>
> ** **
>
> Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream
>  ****
>
> ** **
>
> Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
> is not present on the CDH4 (my local env) hadoop jars.****
>
> ** **
>
> Could you suggest how I could get around this issue. One option could be
> using the amazon specific jars but then probably I would need to get all
> the jars ( else it could cause version mismatch errors for HDFS -
> NoSuchMethodError etc etc ) ****
>
> ** **
>
> Appreciate your help regarding this.****
>
> ** **
>
> - Himanish****
>
> ** **
>
> ** **
>
> On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[EMAIL PROTECTED]>

Thanks & Regards
Himanish