Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput


+
Himanish Kushary 2013-03-29, 14:57
+
David Parks 2013-03-31, 01:26
Copy link to this message
-
Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
I was able to transfer the data to S3 successfully with the earlier
mentioned work-around.Also I was able to max out our available upload
bandwidth.I could get average around 10 MB/s from the cluster.

I ran the s3distcp jobs with the default timeout and did not face any
issues.

Thanks all for the help.

Himanish
On Sat, Mar 30, 2013 at 9:26 PM, David Parks <[EMAIL PROTECTED]> wrote:

> 4-20MB/sec are common transfer rates from S3 to **1** local AWS box, this
> was, of course, a cluster, and s3distcp is specifically designed to take
> advantage of the cluster, so it was a 45 minute job to transfer the 1.5 TB
> to the full cluster of, I forget how many servers I had at the time, maybe
> 15-30 m1.xlarge. The numbers are rough, I could be mistaken and it was 1 ½
> hours to do the transfer (but I recall 45 min), in either case the s3distcp
> job ran longer than the task timeout period, which was the real point I was
> focusing on.****
>
> ** **
>
> I seem to recall needing to re-package their jar as well, but for
> different reasons, they package in some other open source utilities and I
> had version conflicts, so might want to watch for that.****
>
> ** **
>
> I’ve never seen this ProgressableResettableBufferedFileInputStream, so I
> can’t offer much more advise on that one.****
>
> ** **
>
> Good luck! Let us know how it turns out.****
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:[EMAIL PROTECTED]]
> *Sent:* Friday, March 29, 2013 9:57 PM
>
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs
> for 1.0.4 branch (could not find 1.0.3 API's so used
> http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find the"ProgressableResettableBufferedFileInputStream"
> class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.****
>
> ** **
>
> In the meantime I have come out with a dirty workaround by extracting the
> class from the Amazon jar and packaging it into its own separate jar.I am
> actually able to run the s3distcp now on local CDH4 using amazon's jar and
> transfer from my local hadoop to Amazon S3.****
>
> ** **
>
> But the real issue is the throughput. You mentioned that you had
> transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely
> getting 4 MB/s upload speed !! How did you get 100x times speed compared to
> me ? Could you please share any settings/tweaks that you may have done
> to achieve this. Were you on some very specific high bandwidth network ?
> Was is between HDFS on EC2 and amazon S3 ?****
>
> ** **
>
> Looking forward to hear from you.****
>
> ** **
>
> Thanks****
>
> Himanish****
>
> ** **
>
> On Fri, Mar 29, 2013 at 10:34 AM, David Parks <[EMAIL PROTECTED]>
> wrote:****
>
> CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used
> it primarily with 1.0.3, which is what AWS uses, so I presume that's what
> it's tested on.****
>
>
>
> Himanish Kushary <[EMAIL PROTECTED]> wrote:****
>
> Thanks Dave.****
>
> ** **
>
> I had already tried using the s3distcp jar. But got stuck on the below
> error,which made me think that this is something specific to Amazon hadoop
> distribution.****
>
> ** **
>
> Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream
>  ****
>
> ** **
>
> Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
> is not present on the CDH4 (my local env) hadoop jars.****
>
> ** **
>
> Could you suggest how I could get around this issue. One option could be
> using the amazon specific jars but then probably I would need to get all
> the jars ( else it could cause version mismatch errors for HDFS -
> NoSuchMethodError etc etc ) ****
>
> ** **
>
> Appreciate your help regarding this.****
>
> ** **
>
> - Himanish****
>
> ** **
>
> ** **
>
> On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[EMAIL PROTECTED]>

Thanks & Regards
Himanish
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB