our raw data is in S3, and we need to process them, and dump the output to
so there is no need to store the raw data or output on HDFS.
but since MR works on hdfs, we need to copy the raw data from S3 to HDFS,
then launch MR.
on the other hand, I found that hadoop commands work with S3 file system
so we could let our MR jobs directly consume S3, and directly dump out to
are there any speed/performance implications? a rough guess is that it's
probably going to save
a little if we access S3 directly, but not much different, since either a
separate copy or direct consumption
both have to go through the same pipe first. ????
Thanks a lot
Elliott Clark 2012-06-20, 18:08
Rahul Patodi 2012-06-21, 04:45