Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> GSOC: Monitor improvements - draft proposal


Copy link to this message
-
Re: GSOC: Monitor improvements - draft proposal
Supun,

Thanks for the draft! Some feedback -- hopefully it's useful for your
proposal in addition to giving you a better understanding of how
Accumulo is typically run.

"These servers perform different functionalities"

Actually, most serversin an Accumulo cluster are identical to one
another: most are running a TabletServer, and in <1.5, a Logger. The
exceptions are the Master, Monitor, Tracer and GarbageCollector. Master,
monitor and gc are typically run on the same node (monitor and gc are
rather lightweight). Running a tracer on every TabletServer is probably
overkill, but, again, this is another lightweight process, so not
outside the realm of possibilities.

"Create a JMX API for Monitor to gather statistics"

Any plans to include an example 3rd-party monitor that takes advantage
of the internal change from Thrift to JMX? If so, which? I could see
this being very useful for your own verification and validation, not to
mention for 3rd parties (people other than yourself).

"Table Graphs"

I'd be rather interested to see how the amount of data being returned by
a TabletServer correlates with query rate. It would be a neat plot to
see how RFile index size and size of each key-value returned corresponds
with query rate. Maybe it would be cool to have the ability to let users
create composite graphs?

"Trace Visualization"

Not a lot to really see here. Currently you get some rudimentary
information about how long it took to determine which files to delete,
and how long deleting them took (I think). It would be nice to see this
broken down by table, and include file size and other file metadata.

"Server Status Information"

I remember hearing that someone had done some work to actually pop a
shell in the monitor when authenticated over HTTPS. Another cool feature
might be to actually have some greater insight into a node (perhaps
using JMX calls that we wouldn't want publicly available) when properly
authenticated? I'm thinking about being able to view the list of running
scans on a node... being able to introspect the actual scan
options/data, ranges being run, etc.

"Mock Stats Collector"

I would put money that this will pay off in spades as you move forward
testing things.

Some more high-level things...

* Any thought/preference on the JMX library you would want to use?
* Re: Javascript, might want to look at DataTables (jQuery-based),
d3.js, and/or nvd3. Lots of options here, but licensing can be a
concern. Glad you thought about that already.

"Deliverables and Timeline"

I'd try to rethink your timeline a bit; it comes off very waterfall-y to
me. The biggest red-flag to me is the "write documentation" as your last
phase. Coming from experience, this doesn't work 95% of the time.
Something else always comes up, takes longer, w/e and suddenly you have
some code that you just got working and no documentation. I know it's
difficult to create a development schedule when you're not completely
familiar with what will be required of you, but trying to lay out the
work in such a way that you have some concrete, measurable results after
each phase will help you and, I believe, make a much more realistic
schedule (not to mention make the advisor's job easier to see progress :P).

I hope this helps in one way or another.

- Josh

On 4/29/2013 10:46 AM, Supun Kamburugamuva wrote:
 > Hi all,
 >
 > Here is the draft proposal for the Monitor Improvements project.
 >
 >
https://docs.google.com/document/d/1j1YHZJXuzxIrB1udt1RnWZUgZLeo-JX711gEv1l--r8/edit#heading=h.2r66wv56fsz
 >
 > I would really appreciate your feedback.
 >
 > Cheers,
 > Supun..
 >