|
|
-
Can we modify files in HDFS?
elton sky 2010-06-29, 00:29
hello everyone,
After some research I found HDFS only support create new file and append to exiting file. What if I want to modify some parts of a, say 2 Petabyte, file. Do I have to remove it and create it again or we have some alternative way?
-
Re: Can we modify files in HDFS?
Jeff Zhang 2010-06-29, 01:50
You can not modify file in HDFS, HDFS is designed for write once, read many times.
On Tue, Jun 29, 2010 at 8:29 AM, elton sky <[EMAIL PROTECTED]> wrote: > hello everyone, > > After some research I found HDFS only support create new file and append to > exiting file. What if I want to modify some parts of a, say 2 Petabyte, > file. > Do I have to remove it and create it again or we have some alternative way? >
-- Best Regards
Jeff Zhang
-
Re: Can we modify files in HDFS?
elton sky 2010-06-29, 04:48
thanx Jeff,
So...it is a significant drawback. As a matter of fact, there are many cases we need to modify. I dont understand why Yahoo didn't provoid that functionality. And as I know no one else is working on this. Why is that?
-
Re: Can we modify files in HDFS?
Todd Lipcon 2010-06-29, 05:02
Hi Elton,
Typically, large data sets are of the sort that continuously grow, and are not edited or amended. For example, a common Hadoop use case is the analysis of log data or other instrumentation from web or application servers. In these cases, files are simply added, but there is no need to go back and change entries.
For the ability to have a more table-like random access storage on top of Hadoop, I would encourage you to look into HBase. It supports random read/write access with low latency.
-Todd
On Mon, Jun 28, 2010 at 9:48 PM, elton sky <[EMAIL PROTECTED]> wrote:
> thanx Jeff, > > So...it is a significant drawback. > As a matter of fact, there are many cases we need to modify. > I dont understand why Yahoo didn't provoid that functionality. And as I > know > no one else is working on this. Why is that? >
-- Todd Lipcon Software Engineer, Cloudera
-
Re: Can we modify files in HDFS?
Steve Loughran 2010-06-29, 09:57
elton sky wrote: > thanx Jeff, > > So...it is a significant drawback. > As a matter of fact, there are many cases we need to modify. When people say "Hadoop filesystems are not posix", this is what they mean. No locks, no read/write. seeking discouraged. Even append is something that is just stabilising. to be fair though, even NFS is quirky and that's been around since Ether-net was considered so cutting edge it had a hyphen in the title.
HDFS delivers availability through redundant copies across multiple machines: you can read your data on or near any machine with a copy of the data. Think what you'd need for full seek and read/write actions * seek would kill bulk IO perf on classic rotating-disk HDDs, and nobody can afford to build a petabyte filestore out of SSDs yet. You should be streaming, not seeking.
* to do writes, you'd need to lock out access to the files, which implies a distributed lock infrastructure (zookeeper?) or deal with conflicting writes.
* if you want immediate update writes you'd need to push out the changes to the (existing) nodes, and deal with queueing up pending changes to machines that are currently offline in a way that I don't want to think about.
* if you want slower-update writes (eventual consistency), then things may be slightly simpler -you'd need a lock on writing and each write would eventually be pushed out to the readers with a bit better bandwidth and CPU scheduling flexibility , but there's still that offline node problem. If a node that was down comes up, how does it know it's data is out of date and where does it get the data from? What will it do if all other nodes that have updated data are offline. > I dont understand why Yahoo didn't provoid that functionality. And as I know > no one else is working on this. Why is that?
It's because it scares us and we are happier writing code to live in a world where you don't seek and patch files, but instead add new data and delete old stuff. I don't know what the Cassandra and HBase teams do here.
-steve
-
Re: Can we modify files in HDFS?
Aaron Kimball 2010-07-05, 07:38
On Tue, Jun 29, 2010 at 2:57 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> elton sky wrote: > >> thanx Jeff, >> >> So...it is a significant drawback. >> As a matter of fact, there are many cases we need to modify. >> > > > When people say "Hadoop filesystems are not posix", this is what they mean. > No locks, no read/write. seeking discouraged. Even append is something that > is just stabilising. to be fair though, even NFS is quirky and that's been > around since Ether-net was considered so cutting edge it had a hyphen in the > title. > > HDFS delivers availability through redundant copies across multiple > machines: you can read your data on or near any machine with a copy of the > data. Think what you'd need for full seek and read/write actions > > > * seek would kill bulk IO perf on classic rotating-disk HDDs, and nobody > can afford to build a petabyte filestore out of SSDs yet. You should be > streaming, not seeking. > > * to do writes, you'd need to lock out access to the files, which implies a > distributed lock infrastructure (zookeeper?) or deal with conflicting > writes. > > * if you want immediate update writes you'd need to push out the changes to > the (existing) nodes, and deal with queueing up pending changes to machines > that are currently offline in a way that I don't want to think about. > > * if you want slower-update writes (eventual consistency), then things may > be slightly simpler -you'd need a lock on writing and each write would > eventually be pushed out to the readers with a bit better bandwidth and CPU > scheduling flexibility , but there's still that offline node problem. If a > node that was down comes up, how does it know it's data is out of date and > where does it get the data from? What will it do if all other nodes that > have updated data are offline. > > > > > I dont understand why Yahoo didn't provoid that functionality. And as I > know > > no one else is working on this. Why is that? > > It's because it scares us and we are happier writing code to live in a > world where you don't seek and patch files, but instead add new data and > delete old stuff. I don't know what the Cassandra and HBase teams do here. >
It's my understanding that HBase stores datasets in reasonably small files (a few hundred MB each?) where deltas to a section are in more files that are actually stacked on top of older ones. Every so often a garbage collection routine compacts a stack of files for a slice of the table down into a single file for that slice. And yes, Yahoo (nor anybody else, for that matter) has not provided this functionality because it's a lot trickier to add true overwrite capability in a distributed environment than one might think at first glance. The engineering cost of developing modification support has, for all parties interested thus far, been higher than the cost of working around this limitation. :)
> -steve > > > > >
|
|