Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> Regarding the adding of additional sinks/sources for various DB's

Copy link to this message
Re: Regarding the adding of additional sinks/sources for various DB's
On 11/25/2013 12:14 AM, Juhani Connolly wrote:
> Hey guys,
> What I write here is all just my personal opinion and I'm writing in
> hopes of starting a discussion and/or getting feedback. I know I've not
> been very active on the project recently(due to other engagements) but
> do still want it to succeed and hope to find more time for it eventually.
> Right now I see new/active issues for the addition of Redis and Kafka
> sinks, and while they're nice features, I'm personally concerned about
> feature bloat of the project. There are dozens of interceptors, sinks
> and sources that can be thought of, but most of them are very specific
> to a specific use-case.
> Every time we add a new component we're also committing to maintaining
> it over future releases, even if the original contributor gets too busy
> for it. The more such components get added, the more we will get
> distracted from improving core features and getting rid of issues
> affecting them.
> For these reasons I generally haven't submitted components we developed
> for internal use(because they are too specific to our use cases), just
> passing back fixes that fix bugs or apply to the core project.
> For these reasons I think we may want to consider either a) being more
> selective regarding additional component submissions or b) make a
> contrib directory to the project which includes the components but
> doesn't guarrantee ongoing maintenance or compatibility.
> On the flip side of course, taking approaches like this may discourage
> new contributors and could thus be considered a negative, and if many
> people feel this way they should definitely share their thoughts.
> I'd be curious to know what others think, and what direction they hope
> to take the project in the future.

I should probably chime in since I submitted the patch for the Redis sink.

I see the arguments about keeping Apache Flume lean, but I am not sure
their benefits outweigh their costs.

As a user, having Apache Flume able to speak multiple sources and sinks
is a big plus. Having to shop around for various sources/sinks is more
troublesome since I have to first find which flavor of a given sink is
being maintained today, deal with licenses, incompatibilities, mismatch
versions, upgrades, deployment, not fixed bugs and wondering if this is
even going to work at all.
Knowing a piece of code is in Apache Flume puts my mind at ease since
the license is clear, CLA cleared and it has been reviewed. There may be
some expectations regarding its support and quality, but it should be
fine as long as it is clearly stated and labeled (See the contrib idea,
or tagging them with different labels such as "supported",
"experimental"). This also gives more opportunities for bugs to be fixed
and therefore having code better maintained, due to the wider audience
of Apache Flume in comparison to a random small project on github.
Also as a user, I would have to be fairly technical to use a random
source/sink outside of Apache Flume. I would probably have to build it,
qualify it against my version of Apache Flume, and package it for
deployment. Whereas if it is in Apache Flume, it's either already in the
tarball or already in the package of my favorite Apache Flume distribution.
As a developer, Apache Flume is very flexible since I can pick and
choose most parts. But if I have to write my own source and/or my own
sink, I may be tempted to forego Apache Flume altogether and write the
rest myself for my specific use case.
But if I get to write a source for my use case, I don't have much
incentive to make it public or to maintain it with the current Apache
Flume version. I just need to ensure it works for my version of Apache
Flume. Everything else is just extra work.
Also in the context of a company, I would rather target my source/sink
to work with one of vendor supported version of Apache Flume, which may
be different from the latest Apache Flume. I would have no incentive to
go through the effort of testing it against Apache Flume. If my
source/sink was in Apache Flume, I would be more interested in
contributing to Apache Flume since I know the changes would trickle down
at some point and make my life easier.

As an Apache Bigtop contributor, having all these projects spread around
scares me. They will all depend against different versions of Apache
Flume, build in different ways, works in different ways and integrate in
their own way. Sending patches upstream will also be troublesome since
now we would have to talk to and work with a lot more people than just
Apache Flume folks. Each of these people having different schedules and
ways of working.
In conclusion, I believe having a diverse set of Source/Sink/Channel may
not be a bad idea. If such piece is not maintained and no-one is willing
to maintain it, then I don't see why it could not be removed.

In order to prevent a source/sink/channel to rot, besides creating a
contrib area, we could also do the following
* Tag the component based on their known quality and stability
* Be strict about unit tests
* Maybe require some integration tests also.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB