Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput


+
Himanish Kushary 2013-03-28, 03:54
+
Ted Dunning 2013-03-28, 07:45
+
David Parks 2013-03-28, 07:56
Copy link to this message
-
Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
Hi Dave,

Thanks for your reply. Our hadoop instance is inside our corporate
LAN.Could you please provide some details on how i could use the s3distcp
from amazon to transfer data from our on-premises hadoop to amazon s3.
Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
on-premises hadoop instance ? Did you mean use the jar from amazon on our
local server ?

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[EMAIL PROTECTED]> wrote:

> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
> ** **
>
> Hello,****
>
> ** **
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
> ** **
>
> The distcp mapreduce job keeps on failing with this error ****
>
> ** **
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
> ** **
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
> ** **
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
> ** **
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>

--
Thanks & Regards
Himanish
+
David Parks 2013-03-29, 05:41
+
Himanish Kushary 2013-03-29, 13:18
+
David Parks 2013-03-29, 14:34
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB