Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Writing an iterator that calculates on compaction


Copy link to this message
-
Writing an iterator that calculates on compaction
Benson Margulies 2012-03-02, 20:59
Folks,

I am trying to get organized to get my feet wet in using the ability
of accumulo to compute near the data. I beg your pardon in advance for
the following exercise in laying  out what I have in mind and asking
for some pointers -- particularly to examples on the 1.4 branch of
code that I could warp to achieve my nefarious purposes.

So, start with this data model:
  ROWID   CF          CQ            V
  itemid  'context'   dimension     value
  itemid  something   else          entirely...

In short, for an 'item', there's a sparse feature vector associated
with it (identified by cf='context'), and some other things.

Meanwhile, in another table we have:

  clusterid  'items'  itemid1       -blank-
  clusterid  'items'  itemid2       -blank-
In other words, a cluster is a grouping of the items from the first
group, identified by their rowids.

My initial test of my ability to find my way around a brightly lit
room with a flashlight is to calculate the centrolds of these
clusters, and store them as an additional CF:

    CF='centroid' CQ=dimension V=value

And the my second test is to calculate the distance from each item to
the centroid of it's cluster, and store that. Finally, I want to
peruse items in descending order of their distance-from-centroid
values.

TIA