Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # dev - [DISCUSS] Feature bloat and contrib module

Copy link to this message
Re: [DISCUSS] Feature bloat and contrib module
Mike Percy 2013-12-21, 00:01
On Mon, Dec 16, 2013 at 11:34 PM, Bruno Mahé <[EMAIL PROTECTED]> wrote:
> Summarizing my suggestions:
> * Commiters are not the sole developers. There is no reason for commiters
> to take all these responsibilities on their shoulders. Also developer !> commiter.
> * Easy IN, Easy OUT. If no one volunteers to maintain something, then
> there is no reason to keep it since the community is not interested in it
> anyway.
> * Easy to get in means more contributions and more contributors. Also a
> way to grow community and have contributors becoming full commiters. It is
> more than likely they will notice things that can be improved elsewhere and
> start being more active overall.
> * Easy to get out means only the maintained stuff stays. Stuff would most
> likely get kicked out before a feature release (ex: 1.5 vs 1.6). Bug fix
> releases have no reason to kick out components since they are unlikely to
> break in between bug fix releases (ex: 1.5.2 vs 1.5.3).
> * Spreading sources and sinks is going to be quite hard on users. This
> would means users would have to be developers themselves since they would
> have to:
>     - Find the source/sink on some random repository which may or may not
> be maintained. Pick one of the repository out of all the ones the user has
> found
>     - Build it against their own version of Apache Flume (Apache, CDH,
> PHD, HDP...)
>     - Resolve dependencies and build issues between their version of
> Apache Flume and source/sink since the source/sink may or may not have been
> maintained
>     - Qualify the integration between their version of Apache Flume and
> source/sink
> * Spreading sources and sinks is going to be quite hard on developers. Why
> should I target Apache Flume when I can just target my version of Flume
> (CDH, PHD, HDP) ?
> * Spreading sources and sinks is going to be quite hard on integrators
> such as Apache Bigtop. This would means working with as many people as
> there are source/sinks. Each own with their own way of working and
> schedules.

Hey Bruno, great to hear from you on this list!

Good points, and in principle, I mostly agree with what you are saying, but
I have concerns about some of the proposed approaches. Specifically:

So why not just removing features or parts that are not maintained?
> Being more aggressive in removing unmaintained parts would enable Apache
> Flume to be more inclusive with regards to contributions.

Removing stuff breaks back-compat and it is hard to know who is using a
component. If just one person is using something, is it worth it to keep
something? Where do we draw the line? That said, I am not against removing
stuff that made it into a release (after marking it @Deprecated for a
release) if we have consensus among committers that it needs to go.

 As another dimension to this discussion, I think there is a limit to the
>>> number of dependencies Flume can reasonably pull in and keep straight
>>> without shading or classloading tricks, which themselves add another
>>> layer
>>> of pain/difficulty to debugging.
> This does not completely solve that probleme but is somewhat related: what
> about moving all the current sources and sinks as plugins?
> So the core remains lean with all its dependencies in lib/ and all the
> sources and sinks specific libs end up in plugins.d/<plugin>/libext.
> This would be more in the context of Apache Bigtop and packages, but that
> would enable people to pick and choose their dependencies. For instance
> doing a "yum install flume-ng-hdfs flume-ng-redis flume-ng-agent".
> Right now I don't really care about the hdfs sink, but I end up having to
> download a bunch of hdfs related packages that are not really needed.
Well... that actually doesn't solve the dependency problem at all. It
pushes the requirement of knowledge of what works with what to the
end-user. And this type of thing (JAR incompatibility) is nearly impossible
to detect automatically, so we are back to end-users sifting through poms,
Java API docs, and release notes - which is what they would have to do with
a Github project anyway. But now it's for *everything* related to Flume. So
we just made the Flume plugin compatibility situation much worse that it
already was.

Right now, every plugin that ships with Flume can be run in the same JVM
process as every other plugin, with the exception (much to my regret) of
Solr and ElasticSearch. I am loathe to add anything else to that "landmine
list". In my view, we need to come up with a technical solution to that
problem before we decide to open the floodgates to any and all plugins /
dependencies, regardless of the plugin acceptance / maintainability
discussion (the two are orthogonal concerns). Which is why I brought up the
possibility of classloading, or OSGI, or something that attempts to solve
this problem. It's not rocket science (all servlet containers do this), but
it's added implementation / debugging complexity for sure and someone has
to do the work to implement it (if we agree that is the right solution to
the problem here).

TL;DR: I don't think the conflicting-dependencies issue has a "project
policy" or packaging solution.