Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput


Copy link to this message
-
Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs
for 1.0.4 branch (could not find 1.0.3 API's so used
http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find
the "ProgressableResettableBufferedFileInputStream"
class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.

In the meantime I have come out with a dirty workaround by extracting the
class from the Amazon jar and packaging it into its own separate jar.I am
actually able to run the s3distcp now on local CDH4 using amazon's jar and
transfer from my local hadoop to Amazon S3.

But the real issue is the throughput. You mentioned that you had
transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely
getting 4 MB/s upload speed !! How did you get 100x times speed compared to
me ? Could you please share any settings/tweaks that you may have done
to achieve this. Were you on some very specific high bandwidth network ?
Was is between HDFS on EC2 and amazon S3 ?

Looking forward to hear from you.

Thanks
Himanish
On Fri, Mar 29, 2013 at 10:34 AM, David Parks <[EMAIL PROTECTED]>wrote:

> CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used
> it primarily with 1.0.3, which is what AWS uses, so I presume that's what
> it's tested on.
>
>
> Himanish Kushary <[EMAIL PROTECTED]> wrote:
>
> Thanks Dave.
>
> I had already tried using the s3distcp jar. But got stuck on the below
> error,which made me think that this is something specific to Amazon hadoop
> distribution.
>
> Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream
>
>
> Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
> is not present on the CDH4 (my local env) hadoop jars.
>
> Could you suggest how I could get around this issue. One option could be
> using the amazon specific jars but then probably I would need to get all
> the jars ( else it could cause version mismatch errors for HDFS -
> NoSuchMethodError etc etc )
>
> Appreciate your help regarding this.
>
> - Himanish
>
>
>
> On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[EMAIL PROTECTED]>wrote:
>
>> None of that complexity, they distribute the jar publicly (not the
>> source, but the jar). You can just add this to your libjars: s3n://*
>> region*.elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>>
>> ** **
>>
>> No VPN or anything, if you can access the internet you can get to S3. ***
>> *
>>
>> ** **
>>
>> Follow their docs here:
>> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
>> ****
>>
>> ** **
>>
>> Doesn’t matter where you’re Hadoop instance is running.****
>>
>> ** **
>>
>> Here’s an example of code/parameters I used to run it from within another
>> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
>> line normally.****
>>
>> ** **
>>
>>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>>
>>               "--src",             "/frugg/image-cache-stage2/",****
>>
>>               "--srcPattern",      ".*part.*",****
>>
>>               "--dest",            "s3n://fruggmapreduce/results-"+env+
>> "/" + JobUtils.*isoDate* + "/output/itemtable/", ****
>>
>>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>>
>> ** **
>>
>> Watch the “srcPattern”, make sure you have that leading `.*`, that one
>> threw me for a loop once.****
>>
>> ** **
>>
>> Dave****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Himanish Kushary [mailto:[EMAIL PROTECTED]]
>> *Sent:* Thursday, March 28, 2013 5:51 PM
>> *To:* [EMAIL PROTECTED]
>> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput*
>> ***
>>
>> ** **
>>
>> Hi Dave,****
>>
>> ** **
>>
>> Thanks for your reply. Our hadoop instance is inside our corporate
>> LAN.Could you please provide some details on how i could use the s3distcp
>> from amazon to transfer data from our on-premises hadoop to amazon s3.

Thanks & Regards
Himanish