Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Using S3 instead of HDFS


Copy link to this message
-
Re: Using S3 instead of HDFS
Awesome important, Matt, thank you so much!

Mark

On Wed, Jan 18, 2012 at 10:53 AM, Matt Pouttu-Clarke <
[EMAIL PROTECTED]> wrote:

> I would strongly suggest using this method to read S3 only.
>
> I have had problems with writing large volumes of data to S3 from Hadoop
> using native s3fs.  Supposedly a fix is on the way from Amazon (it is an
> undocumented internal error being thrown).  However, this fix is already 2
> months later than we expected it and we currently have no ETA.
>
> If you want to write data to S3 reliably, you should use the S3 API
> directly and stream data from HDFS into S3.  Just remember that S3
> requires the final size of the data before you start writing so it is not
> true streaming in that sense.  After you have completed writing your part
> files in your job (writing to HDFS), you can write a map-only job to
> stream the data up into S3 using the S3 API directly.
>
> In no way, shape, or form should S3 be currently considered as a
> replacement for HDFS when it come to writes.  Your jobs will need to be
> modified and customized to write to S3 reliably, there are files size
> limits on writes, and the multi-part upload option does not work correctly
> and randomly throws an internal Amazon error.
>
> You have been warned!
>
> -Matt
>
> On 1/18/12 9:37 AM, "Mark Kerzner" <[EMAIL PROTECTED]> wrote:
>
> >It worked, thank you, Harsh.
> >
> >Mark
> >
> >On Wed, Jan 18, 2012 at 1:16 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >
> >> Ah sorry about missing that. Settings would go in core-site.xml
> >> (hdfs-site.xml will no longer be relevant anymore, once you switch to
> >>using
> >> S3).
> >>
> >> On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote:
> >>
> >> > That wiki page mentiones hadoop-site.xml, but this is old, now you
> >>have
> >> > core-site.xml and hdfs-site.xml, so which one do you put it in?
> >> >
> >> > Thank you (and good night Central Time:)
> >> >
> >> > mark
> >> >
> >> > On Wed, Jan 18, 2012 at 12:52 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> >> >
> >> >> When using S3 you do not need to run any component of HDFS at all. It
> >> >> is meant to be an alternate FS choice. You need to run only MR.
> >> >>
> >> >> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
> >> >> how to go about specifying your auth details to S3, either directly
> >> >> via the fs.default.name URI or via the additional properties
> >> >> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
> >> >> for you?
> >> >>
> >> >> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner <
> >> [EMAIL PROTECTED]>
> >> >> wrote:
> >> >>> Well, here is my error message
> >> >>>
> >> >>> Starting Hadoop namenode daemon: starting namenode, logging to
> >> >>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
> >> >>> ERROR. Could not start Hadoop namenode daemon
> >> >>> Starting Hadoop secondarynamenode daemon: starting
> >>secondarynamenode,
> >> >>> logging to
> >> >>>
> >> >>
> >>
> >>/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26
> >>.out
> >> >>> Exception in thread "main" java.lang.IllegalArgumentException:
> >>Invalid
> >> >> URI
> >> >>> for NameNode address (check fs.default.name): s3n://myname.testdata
> >>is
> >> >> not
> >> >>> of scheme 'hdfs'.
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:
> >>224)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNod
> >>e.java:209)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Secon
> >>daryNameNode.java:182)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.<init>(Secondary
> >>NameNode.java:150)
> >> >>>       at
> >> >>>
> >> >>
> >>
> >>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNa
> >>meNode.java:624)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB