On Fri, Mar 2, 2012 at 8:16 PM, Billie J Rinaldi
<[EMAIL PROTECTED]> wrote:
> To calculate the centroid using an iterator, you would need to store
> the data for each item in the cluster table. Iterators only read local
> data. For example,
>> clusterid 'items' itemid1 dimension value
> where you pack 'items' and itemid1 into the CF or itemid1 and
> dimension into the CQ.
> This could allow an iterator to calculate a centroid. If the
> calculation is expensive (the clusters are very large or have
> extremely high dimensionality), you may just want to run a task
> that reads the data and writes out the centroid instead of using
> an iterator for this. It might be most useful to have an iterator
> that adds new data into an existing centroid. Then you would
> need a way of determining what is new.
I see, I'm prematurely optimizing here. That's a side-effect of trying
to simultaneously think about the actual clustering problem and using
accumulo at all. The number of dimensions per item isn't gigantic;
it's a feature vector representing the context around a name mention
in text. The number of clusters will eventually be large, that's the
reason to think big in the first place. The current implementation
just stuffs all data into a Lucene index. The clustering uses a
combination of vector-space distance on these vectors and also a
separate metric of textual name similarity; that later starts with a
lucene query to find candidates and then scores them.
You have indeed identified my eventual reason to think about for an
iterator-based solution, which is to be able to ingest new items
incrementally, updating centroids etc. However, I had better
demonstrate that I can port the whole business into accumulo at all
first before I try to make it work incrementally and optimize that.
It occurs to me that another facet of my ignorance is on display here.
When you wrote 'task' above, were you thinking MR, or a conventional
program talking to the accumulo API, or something else?
Anyway, thanks for continuing to help me think my way around this stuff.
Meanwhile, just for long-term code reading, is one of the iterators in
the 1.4 tree a good model for compaction? I also still have this idea
to experiment with making a Petrovic LSH that way that I might
possibly try out at some point.
> Once the centroid and data are written in the same table, another
> iterator could calculate the distance to centroid for each itemid. The
> data will have to be resorted to get it arranged by distance, so an
> efficient way to do that might be to run a scan-time iterator and
> write the distance data back to an Accumulo table (possibly the same
> one) using a batch writer or MR output format.
> ----- Original Message -----
>> From: "Benson Margulies" <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]
>> Sent: Friday, March 2, 2012 3:59:34 PM
>> Subject: Writing an iterator that calculates on compaction
>> I am trying to get organized to get my feet wet in using the ability
>> of accumulo to compute near the data. I beg your pardon in advance for
>> the following exercise in laying out what I have in mind and asking
>> for some pointers -- particularly to examples on the 1.4 branch of
>> code that I could warp to achieve my nefarious purposes.
>> So, start with this data model:
>> ROWID CF CQ V
>> itemid 'context' dimension value
>> itemid something else entirely...
>> In short, for an 'item', there's a sparse feature vector associated
>> with it (identified by cf='context'), and some other things.
>> Meanwhile, in another table we have:
>> clusterid 'items' itemid1 -blank-
>> clusterid 'items' itemid2 -blank-
>> In other words, a cluster is a grouping of the items from the first
>> group, identified by their rowids.
>> My initial test of my ability to find my way around a brightly lit
>> room with a flashlight is to calculate the centrolds of these