Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Using S3 instead of HDFS


Copy link to this message
-
Re: Using S3 instead of HDFS
Mark Kerzner 2012-01-18, 16:56
Awesome important, Matt, thank you so much!

Mark

On Wed, Jan 18, 2012 at 10:53 AM, Matt Pouttu-Clarke <
[EMAIL PROTECTED]> wrote:

> I would strongly suggest using this method to read S3 only.
>
> I have had problems with writing large volumes of data to S3 from Hadoop
> using native s3fs.  Supposedly a fix is on the way from Amazon (it is an
> undocumented internal error being thrown).  However, this fix is already 2
> months later than we expected it and we currently have no ETA.
>
> If you want to write data to S3 reliably, you should use the S3 API
> directly and stream data from HDFS into S3.  Just remember that S3
> requires the final size of the data before you start writing so it is not
> true streaming in that sense.  After you have completed writing your part
> files in your job (writing to HDFS), you can write a map-only job to
> stream the data up into S3 using the S3 API directly.
>
> In no way, shape, or form should S3 be currently considered as a
> replacement for HDFS when it come to writes.  Your jobs will need to be
> modified and customized to write to S3 reliably, there are files size
> limits on writes, and the multi-part upload option does not work correctly
> and randomly throws an internal Amazon error.
>
> You have been warned!
>
> -Matt
>
> On 1/18/12 9:37 AM, "Mark Kerzner" <[EMAIL PROTECTED]> wrote:
>
> >It worked, thank you, Harsh.
> >
> >Mark
> >
> >On Wed, Jan 18, 2012 at 1:16 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >
> >> Ah sorry about missing that. Settings would go in core-site.xml
> >> (hdfs-site.xml will no longer be relevant anymore, once you switch to
> >>using
> >> S3).
> >>
> >> On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote:
> >>
> >> > That wiki page mentiones hadoop-site.xml, but this is old, now you
> >>have
> >> > core-site.xml and hdfs-site.xml, so which one do you put it in?
> >> >
> >> > Thank you (and good night Central Time:)
> >> >
> >> > mark
> >> >
> >> > On Wed, Jan 18, 2012 at 12:52 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >> >
> >> >> When using S3 you do not need to run any component of HDFS at all. It
> >> >> is meant to be an alternate FS choice. You need to run only MR.
> >> >>
> >> >> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
> >> >> how to go about specifying your auth details to S3, either directly
> >> >> via the fs.default.name URI or via the additional properties
> >> >> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
> >> >> for you?
> >> >>
> >> >> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner <
> >> [EMAIL PROTECTED]>
> >> >> wrote:
> >> >>> Well, here is my error message
> >> >>>
> >> >>> Starting Hadoop namenode daemon: starting namenode, logging to
> >> >>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
> >> >>> ERROR. Could not start Hadoop namenode daemon
> >> >>> Starting Hadoop secondarynamenode daemon: starting
> >>secondarynamenode,
> >> >>> logging to
> >> >>>
> >> >>
> >>
> >>/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26
> >>.out
> >> >>> Exception in thread "main" java.lang.IllegalArgumentException:
> >>Invalid
> >> >> URI
> >> >>> for NameNode address (check fs.default.name): s3n://myname.testdata
> >>is
> >> >> not
> >> >>> of scheme 'hdfs'.
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:
> >>224)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNod
> >>e.java:209)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Secon
> >>daryNameNode.java:182)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.<init>(Secondary
> >>NameNode.java:150)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNa
> >>meNode.java:624)