Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Copy link to this message
Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

The distcp mapreduce job keeps on failing with this error

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"

and in the task attempt logs I can see lot of INFO messages like

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the "
mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
 Looking forward for suggestions.
Thanks & Regards