Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> S3-->HDFS -->MR or S3-->MR ?


Copy link to this message
-
S3-->HDFS -->MR or S3-->MR ?
our raw data is in S3, and we need to process them, and dump the output to
S3.

so there is no need to store the raw data or output on HDFS.

but since MR works on hdfs,  we need to copy the raw data from S3 to HDFS,
then launch MR.

on the other hand, I found that hadoop commands work with S3 file system
naturally,
so we could let our MR jobs directly consume S3, and directly dump out to
S3.
are there any speed/performance implications? a rough guess is that it's
probably going to save
a little if we access S3 directly, but not much different, since either a
separate copy or direct consumption
both have to go through the same pipe first. ????

Thanks a lot
Yang