Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput


+
Himanish Kushary 2013-03-28, 03:54
+
Ted Dunning 2013-03-28, 07:45
+
David Parks 2013-03-28, 07:56
Copy link to this message
-
Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
Hi Dave,

Thanks for your reply. Our hadoop instance is inside our corporate
LAN.Could you please provide some details on how i could use the s3distcp
from amazon to transfer data from our on-premises hadoop to amazon s3.
Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
on-premises hadoop instance ? Did you mean use the jar from amazon on our
local server ?

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[EMAIL PROTECTED]> wrote:

> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
> ** **
>
> Hello,****
>
> ** **
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
> ** **
>
> The distcp mapreduce job keeps on failing with this error ****
>
> ** **
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
> ** **
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
> ** **
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
> ** **
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>

--
Thanks & Regards
Himanish
+
David Parks 2013-03-29, 05:41
+
Himanish Kushary 2013-03-29, 13:18
+
David Parks 2013-03-29, 14:34