On Tue, Jun 29, 2010 at 2:57 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> elton sky wrote:
>> thanx Jeff,
>> So...it is a significant drawback.
>> As a matter of fact, there are many cases we need to modify.
> When people say "Hadoop filesystems are not posix", this is what they mean.
> No locks, no read/write. seeking discouraged. Even append is something that
> is just stabilising. to be fair though, even NFS is quirky and that's been
> around since Ether-net was considered so cutting edge it had a hyphen in the
> HDFS delivers availability through redundant copies across multiple
> machines: you can read your data on or near any machine with a copy of the
> data. Think what you'd need for full seek and read/write actions
> * seek would kill bulk IO perf on classic rotating-disk HDDs, and nobody
> can afford to build a petabyte filestore out of SSDs yet. You should be
> streaming, not seeking.
> * to do writes, you'd need to lock out access to the files, which implies a
> distributed lock infrastructure (zookeeper?) or deal with conflicting
> * if you want immediate update writes you'd need to push out the changes to
> the (existing) nodes, and deal with queueing up pending changes to machines
> that are currently offline in a way that I don't want to think about.
> * if you want slower-update writes (eventual consistency), then things may
> be slightly simpler -you'd need a lock on writing and each write would
> eventually be pushed out to the readers with a bit better bandwidth and CPU
> scheduling flexibility , but there's still that offline node problem. If a
> node that was down comes up, how does it know it's data is out of date and
> where does it get the data from? What will it do if all other nodes that
> have updated data are offline.
> > I dont understand why Yahoo didn't provoid that functionality. And as I
> > no one else is working on this. Why is that?
> It's because it scares us and we are happier writing code to live in a
> world where you don't seek and patch files, but instead add new data and
> delete old stuff. I don't know what the Cassandra and HBase teams do here.
It's my understanding that HBase stores datasets in reasonably small files
(a few hundred MB each?) where deltas to a section are in more files that
are actually stacked on top of older ones. Every so often a garbage
collection routine compacts a stack of files for a slice of the table down
into a single file for that slice.
And yes, Yahoo (nor anybody else, for that matter) has not provided this
functionality because it's a lot trickier to add true overwrite capability
in a distributed environment than one might think at first glance. The
engineering cost of developing modification support has, for all
parties interested thus far, been higher than the cost of working around
this limitation. :)