I'd like to echo many of the comments / discussion points here,
including the extension registry (#3), NAR packs, and mixins. A couple
of additional comments and caveats:
NAR package management:
- Grouping NAR packs based on functionality (Hadoop, RDBMS, etc.) is a
good first start but it still seems like we'd want to end up with an a
la carte capability at the end. An incremental approach might be to
have a simple graphical tool (in the toolkit?) pointing at your NiFi
install and some common repository, where you can add and delete NAR
packs, but also delete individual NARs from your NiFi install. The use
case here is when you download the Hadoop NAR pack for HBase and
related components, but don't want things like the Hive NAR (which I
think is the largest at ~93MB).
- Some NiFi installs will be located on systems that cannot contact an
outside (or any external) repository. When we consider NAR
repositories, we should consider providing a repo-to-go or something
of that sort. At the very least I would think the Extension Registry
itself would support such a thing; the ability to have an Extension
Registry anywhere, not just attached to Bintray or Apache repo HTTP
pages, etc.
- Murphy's Law says as soon as we pick NAR pack boundaries, there will
be components that don't fit well into one or another, or they fit
into more than one. For instance, a user might expect the Spark/Livy
NAR to be in the Hadoop NAR pack but there is no requirement for Spark
or Livy to run on Hadoop. Perhaps with a "Big Data" NAR pack (versus
Hadoop) it would encompass the Hadoop and Spark stuff, but then where
does Cassandra fit in? It certainly handles Big Data, but if there
were a "NoSQL" NAR pack, which should it belong to (or can it be in
both?).
- Because NARs are unpacked before use in NiFi, there are two related
footprints, the footprint of the NARs in the lib/ folder, and the
footprint of the unpacked NARs. As part of the "duplicate JARs"
discussion, this also segues into another area, the runtime footprint
(to include classloader hierarchies, etc.)
Optimized JARs/classloading
- Promoting JARs to the lib/ folder because they are common to many
processors is not the right solution IMO. With parent-first
classloaders (which is what NarClassLoaders are), if you had a NAR
that needed a different version of a library, then it would find the
parent version first and would likely cause issues. We could make the
NarClassLoader self-first (which we might want to do under other
circumstances anyway), but then care would need to be taken to ensure
that shared/API dependencies are indeed "provided".
- I do like the idea of "promotion" though, not just for JAR
deduplication but also for better classloading. Here's an idea for how
we might achieve this. When unpacking NARs, we would do something
similar to a Maven install, where we build up a repository of
artifacts. If two artifacts are the same (we'd likely want to verify
checksums too, not just Maven coordinates), they'd install to the same
place. At the end of NAR unpacking, the repo would contain unique
(de-duplicated) JARs, and each NAR would have a bill-of-materials
(BOM) from which to build its classloader. An possible runtime
improvement on top of that is to build a classloader hierarchy, where
JARs shared by multiple NARs could be in their own classloader, which
would be the parent of the NARs' classloaders. This way, instead of
the same classes loaded into each NAR's classloader, they would only
be loaded once into a shared parent. This "de-dupes" the memory
footprint of the JARs as well. Hopefully the construction of the
classloader graph would not be too computationally intensive, but we
could have a best-effort algorithm rather than an optimal one if that
were an issue.
Thoughts? Thanks,
Matt
On Tue, Jan 16, 2018 at 12:52 PM, Kevin Doran <[EMAIL PROTECTED]> wrote:
> Nice discussion on this thread.
>
> I'm also in favor of the long-term solution being publishing extension NARs to an extension registry (#3) and removing them from the NiFi convenience binary.