-Writing an iterator that calculates on compaction
Benson Margulies 2012-03-02, 20:59
I am trying to get organized to get my feet wet in using the ability
of accumulo to compute near the data. I beg your pardon in advance for
the following exercise in laying out what I have in mind and asking
for some pointers -- particularly to examples on the 1.4 branch of
code that I could warp to achieve my nefarious purposes.
So, start with this data model:
ROWID CF CQ V
itemid 'context' dimension value
itemid something else entirely...
In short, for an 'item', there's a sparse feature vector associated
with it (identified by cf='context'), and some other things.
Meanwhile, in another table we have:
clusterid 'items' itemid1 -blank-
clusterid 'items' itemid2 -blank-
In other words, a cluster is a grouping of the items from the first
group, identified by their rowids.
My initial test of my ability to find my way around a brightly lit
room with a flashlight is to calculate the centrolds of these
clusters, and store them as an additional CF:
CF='centroid' CQ=dimension V=value
And the my second test is to calculate the distance from each item to
the centroid of it's cluster, and store that. Finally, I want to
peruse items in descending order of their distance-from-centroid