Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # dev - [DISCUSS] Remove append?


Copy link to this message
-
Re: [DISCUSS] Remove append?
Konstantin Shvachko 2012-03-22, 08:26
Eli,

I went over the entire discussion on the topic, and did not get it. Is
there a problem with append? We know it does not work in hadoop-1,
only flush() does. Is there anything wrong with the new append
(HDFS-265)? If so please file a bug.
I tested it in Hadoop-0.22 branch it works fine.

I agree with people who were involved with the implementation of the
new append that the complexity is mainly in
1. pipeline recovery
2. consistent client reading while writing, and
3. hflush()
Once it is done the append itself, which is reopening of previously
closed files for adding data, is not complex.

You mentioned it and I agree you indeed should be more involved with
your customer base. As for eBay, append was of the motivations to work
on stabilizing 0.22 branch. And there is a lot of use cases which
require append for our customers.
Some of them were mentioned in this discussion.

Thanks,
--Konstantin
On Tue, Mar 20, 2012 at 5:37 PM, Eli Collins <[EMAIL PROTECTED]> wrote:
> Hey gang,
>
> I'd like to get people's thoughts on the following proposal. I think
> we should consider removing append from HDFS.
>
> Where we are today.. append was added in the 0.17-19 releases
> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
> issues. It and sync were re-designed, re-implemented, and shipped in
> 21.0 (HDFS-265). To my knowledge, there has been no real production
> use. Anecdotally people who worked on branch-20-append have told me
> they think the new trunk code is substantially less well-tested than
> the branch-20-append code (at least for sync, append was never well
> tested). It has certainly gotten way less pounding from HBase users.
> The design however, is much improved, and people think we can get
> hsync (and append) stabilized in trunk (mostly testing and bug
> fixing).
>
> Rationale follows..
>
> Append does not seem to be an important requirement, hflush was. There
> has not been much demand for append, from users or downstream
> projects. Because Hadoop 1.x does not have a working append
> implementation (see HDFS-3120, the branch-20-append work was focused
> on sync not getting append working) which is not enabled by default
> and downstream projects will want to support Hadoop 1.x releases for
> years, most will not introduce dependencies on append anyway. This is
> not to say demand does not exist, just that if it does, it's been much
> smaller than security, sync, HA, backwards compatbile RPC, etc. This
> probably explains why, over 5 years after the original implementation
> started, we don't have a stable release with append.
>
> Append introduces non-trivial design and code complexity, which is not
> worth the cost if we don't have real users. Removing append means we
> have the property that HDFS blocks, when finalized, are immutable.
> This significantly simplifies the design and code, which significantly
> simplifies the implementation of other features like snapshots,
> HDFS-level caching, dedupe, etc.
>
> The vast majority of the HDFS-265 effort is still leveraged w/o
> append. The new data durability and read consistency behavior was the
> key part.
>
> GFS, which HDFS' design is based on, has append (and atomic record
> append) so obviously a workable design does not preclude append.
> However we also should not ape the GFS feature set simply because it
> exists. I've had conversations with people who worked on GFS that
> regret adding record append (see also
> http://queue.acm.org/detail.cfm?id=1594206). In short, unless append
> is a real priority for our users I think we should focus our energy
> elsewhere.
>
> Thanks,
> Eli